Introduction

Ovarian tumours are of various histological types, including benign, borderline, and malignant lesions [1,2,3]. Benign tumours have good prognosis and are treated conservatively and with regular follow-up observations [2, 4]. Epithelial hyperplasia and nuclear atypia are more prominent in borderline ovarian tumours (BOTs) than in benign ovarian tumours; however, BOTs have no stromal invasion, unlike ovarian malignancies [5]. BOTs have good prognosis, with a 10-year survival rate of > 95% for stages I, II, and III [6]. The primary treatment for BOTs is surgical intervention; however, more than one-third of BOTs cases occur in women aged under 40 years who may want to conceive in the future [1]. Therefore, prioritising fertility preservation in young women desiring to have children is crucial. Patients with malignant ovarian tumours should be referred to gynaecologic oncologists for further diagnosis and treatment, and depending on the stage of cancer, debulking surgery and chemotherapy may be considered [7]. Different types of ovarian tumours have distinct clinical and pathological characteristics, treatment strategies, and prognoses. The early detection and treatment of ovarian malignancies can improve patient outcomes [8]. Therefore, the preoperative identification of the nature of ovarian tumours is critical for patients and can guide physicians in developing individualised and precise management plans.

Ultrasonography, especially transvaginal, is considered the primary method for evaluating adnexal tumours [9, 10]. Currently, subjective assessment by ultrasound (US) experts is a relatively good method of distinguishing the nature of ovarian tumours. However, US specialists are few, and differences in subjective diagnoses among US physicians with different experience levels exist [11, 12]. Therefore, objectively and quantitatively analysing the various imaging features that may reveal the potential biological characteristics of tumours in a reproducible manner is necessary.

Radiomics is an emerging field of quantitative imaging that can significantly impact personalised medicine. They can mine quantitative features from medical images using high-throughput methods, which are then transformed into objective and structured data through complex algorithms and applied to clinical decision support systems to improve diagnosis, prognosis assessment, and prediction accuracy [13, 14]. Previous studies on computed tomography (CT)/magnetic resonance imaging (MRI)/US-based radiomics for differentiating benign and malignant ovarian tumours achieved satisfactory diagnostic results [15,16,17,18]. However, radiomic features are predefined, including morphology, intensity, texture, and wavelet features, which are superficial and low-order, and cannot represent the heterogeneity of the entire tumour [19, 20]. Therefore, to accurately classify ovarian tumours, studying their deeper- and higher-level features is necessary.

Deep learning (DL) is a branch of machine learning (ML) that allows computing models with multiple processing layers to learn data representations at numerous abstraction levels [21]. The convolutional neural network (CNN) is the most commonly used DL architecture type in medical image analysis [22]. Suggestions that CNN-extracted features can provide various high-order features of images and apply them to specific clinical outcomes exist [20]. Successful application of DL requires a large number of training sets. However, medical data sets are often limited in number. Many practical applications currently use CNNs pre-trained on ImageNet, known as transfer learning (TL), to replace DL [23, 24]. Research using deep transfer learning (DTL) to classify benign and malignant ovarian tumours has been successful [11, 12, 25]. However, BOTs were categorised as malignant ovarian tumours for statistical analysis. Combining DL classification networks with traditional hand-crafted radiomics frameworks is a new development [26, 27]. Few reports exist on US-based combined DL radiomics (DLR) models as multi-classification prediction models for classifying ovarian tumours as benign, borderline, or malignant. We hypothesised that DLR could differentiate between benign, borderline, and malignant ovarian tumours. Hence, this study aimed to develop an US-based DLR to identify benign, borderline, and malignant ovarian lesions.

Materials and methods

Study design and participants

We enrolled 849 patients with ovarian tumours confirmed by histopathological examination after surgical removal from July 2014 to October 2022. The inclusion criteria were: (a) complete US examination within 1 month before surgery and (b) a clear and definite US image of the target lesion. The exclusion criteria were: (a) poor image quality, (b) absent or incomplete US and clinical data, (c) pregnancy, (d) history of tumours in other parts of the body and ovarian metastatic cancer, (e) previous treatment before US examination or surgery, and (f) pathological diagnosis obtained through biopsy and uncertain pathology results. A flowchart of the participants is shown in Fig. 1.

Fig. 1
figure 1

Inclusion and exclusion criteria for patients with ovarian tumours for the training and testing sets. Abbreviation: BOTs = borderline ovarian tumours

The study population was categorised into different labels based on pathological results, with benign ovarian tumours labelled as “class 0”, BOT as “class 1”, and malignant ovarian tumours as “class 2”. Participant data was randomised into training and testing sets in a ratio of 8:2 using Python’s statistical package. Our data random partitioning adopted a stratified method to handle imbalanced data between the training and testing sets; hence, the proportion of patients with benign, borderline, and malignant ovarian tumours in the total study population, training set, and testing set was similar. No data overlap occurred between the training and testing sets, avoiding the repeated use of data from the same patient [28].

The training set was used to learn the parameters and build the model, whereas the testing set was used to evaluate the generalisability of the selected model and prevent overfitting.

Collecting clinical parameters

Preoperative clinical data of all patients, including age, menopausal status, height, weight, body mass index (BMI), carbohydrate antigen 125 (CA125), red blood cell count (RBC), white blood cell count (WBC), neutrophil count (N), lymphocyte count (L), monocyte count (M), platelet count (PLT), and haemoglobin were obtained from the patient’s electronic medical records. BMI and some inflammation-related risk factors, such as the neutrophil-to-lymphocyte ratio (NLR), derived neutrophil-to-lymphocyte ratio (dNLR), platelet-to-lymphocyte ratio (PLR), lymphocyte-to-monocyte ratio (LMR), and systemic immune-inflammation index (SII), were calculated using the following simple formulas:

$$ BMI=\frac{weight\left(kg\right)}{{height}^{2}\left({m}^{2}\right)}$$
$$ NLR=\frac{N\left({10}^{9}\right)}{L\left({10}^{9}\right)}$$
$$ dNLR=\frac{N\left({10}^{9}\right)}{(WBC-N)\left({10}^{9}\right)}$$
$$ PLR=\frac{PLT\left({10}^{9}\right)}{L\left({10}^{9}\right)}$$
$$ LMR=\frac{L\left({10}^{9}\right)}{M\left({10}^{9}\right)}$$
$$ SII=\frac{N\left({10}^{9}\right)\times PLT\left({10}^{9}\right)}{L\left({10}^{9}\right)}$$

Ultrasound data acquisition

All participants underwent transvaginal ultrasonography whenever possible. If a mass was too large to be fully displayed on transvaginal ultrasonography, it could be supplemented with a transabdominal US. Transrectal or transabdominal ultrasonography could be performed if a patient was unsuitable for transvaginal ultrasonography. The following US equipment was used in the study: GE Voluson E10, GE Voluson E8, GE Healthcare (GE Medical Systems, Zipf, Austria), and Mindray Resona R9 (Mindray Bio-Medical Electronics Co., Ltd., China), with RIC5-9-D, V11-3HU transvaginal US probes, and C1-5-D and SC6-1U abdominal US probes. Recorded US semantic features included: maximum diameter of the lesion (≤ 50, 50–100, and ≥ 100 mm), characteristics of the mass (cystic, cystic-solid mixed, solid), colour Doppler score (1, no blood flow signal; 2, low blood flow signal; 3, moderate blood flow signal; 4, rich blood flow signal), laterality of the mass (unilateral or bilateral), and ascites (present or absent). If a patient had more than one ovarian mass, we selected the mass with the most complex morphology or the largest for further assessment [12, 29, 30].

The specialized assessment of ultrasound images

Initially, the ultrasound image was assessed by Doctor A, a seasoned gynecology and obstetrics ultrasound specialist with ten years of professional experience, who provided the initial diagnosis. Subsequently, Doctor B, another gynecology and obstetrics ultrasound expert with over 15 years of experience, confirmed the diagnosis. In cases of discordant opinions, a senior expert in gynecology and obstetrics ultrasound with more than two decades of experience was consulted, leading to a consensus through collaborative discussion. These doctors were unaware of the patient’s clinical and biochemical indicators or pathological results.

Image pre-processing and regions of interest (ROI) segmentation

The grey-level ranges of two-dimensional images obtained using different US devices vary significantly, and the voxel spacing of images obtained using different US devices are typically different. To address these problems, we employed a fixed-resolution resampling method.

The US images were imported into the ITK-SNAP 3.8.0 software (http://www.itksnap.org) for manual ROI segmentation. Segmentation of all ROI was completed by A (an US expert with > 10 years of experience) and confirmed by B (an US expert with > 15 years of experience). When there were differences in opinion, a senior physician (an US expert with > 20 years of experience) was consulted for joint decision-making. To ensure the robustness and repeatability of the extracted radiomics features, we randomly selected 50 US images from the dataset two weeks later, in which A re-delineated the ROIs and C (an US expert with 12 years of experience) independently delineated the ROI simultaneously. All the US experts were blinded to the clinical and pathological results of the study population.

For DTL, the slice of the US image with the largest tumour area was trimmed to represent each patient. The grey values were normalised to the range [-1, 1] using a min-max transformation. Then, each cropped subregion US image was resized to 224 × 224 by the nearest interpolation method and saved as a “.png” file to meet the requirements for input into a CNN model.

Hand-crafted radiomics feature extraction and selection

We employed PyRadiomics (http://pyradiomics.readthedocs.io) to extract the handcrafted radiomic features. Subsequently, Z-score normalisation was performed to eliminate differences in the value scales of the extracted features.

A total of 1476 handcrafted radiomics features were extracted from tasks 1 and 2, including the first-order features, shape features, gray-level dependence matrix (GLDM), gray-level size zone matrix (GLSZM), gray-level run length matrix (GLRLM), and gray-level co-occurrence matrix (GLCM). The number and proportion of handcrafted radiomics features are presented in Fig. 2. The P-values for all handcrafted features are shown in Fig. 3.

Fig. 2
figure 2

The proportion of hand-crafted radiomics features. Abbreviation: GLDM = grey-level dependence matrices, GLSZM = grey-level size zone matrices, GLRLM = grey-level run length matrices, GLCM = grey-level co-occurrence matrices

Fig. 3
figure 3

All hand-crafted radiomics features’ corresponding P-value results. Abbreviation: GLDM = grey-level dependence matrices, GLSZM = grey-level size zone matrices, GLRLM = grey-level run length matrices, GLCM = grey-level co-occurrence matrices

First, we retained hand-crafted radiomic features with intra-/inter-class correlation coefficient > 0.8, to ensure the robustness and repeatability of these features. Only 1,444 features with P < 0.05 after a T- or Mann–Whitney U-test were retained. Subsequently, spearman correlation analysis was used to calculate the correlation between features. A feature with a correlation coefficient of more than 0.9 between any two features is retained; thus, using a greedy recursive deletion strategy to maintain the features strongly correlated with the predicted target, 295 features were retained. Finally, least absolute shrinkage and selection operator (LASSO) regression algorithms were used for feature selection. Depending on the regulation weight λ, LASSO shrinks all regression coefficients towards zero and sets the coefficients of the irrelevant features precisely to zero. We employed 10-fold cross-validation with the minimum criteria to determine the optimal λ, where the final value of λ (0.016768) yielded the minimum cross-validation error. We retained 53 non-zero-coefficient features as optimal features.

The deep transfer learning procedure

We used DTL, a CNN model pre-trained on the ImageNet dataset, to avoid overfitting owing to the limited size of the training dataset.

Data augmentation is often required to improve DTL’s prediction performance and generalisation ability in image classification because of imbalanced or insufficient data. Hence, we utilised horizontal flipping and random cropping for data augmentation, which helped increase the sample size and enhance the model performance.

To better perform the generalisation, we carefully set the learning rate. In this study, we adopted a cosine-decay learning rate algorithm. The learning rates are presented in Additional file 1.

Signature building

The baseline clinical data were analysed in the training set. Clinical parameters and ultrasonic semantic features with P < 0.05 were selected, and spearman correlation analysis was used to determine the linear relationship between these parameters. Parameters without a significant linear correlation were inputted into the support vector machine model to build clinical signature (Clinic_Sig).

After the LASSO regression feature screening, the optimal features were input into the Light Gradient Boosting Machine (LightGBM) model to construct radiomic signature (Rad_Sig).

After the US image of the mass with the largest section was inputted into a ResNet50 model, the prediction probability of each sample was used as deep transfer learning (DTL_Sig). Gradient-weighted class activation mapping (Grad-CAM) was applied to visualise the internal network algorithm and explain the decision basis of the CNN model.

We fused the prediction results of Rad_Sig, DTL_Sig, and Clinic_Sig for each sample as new features, put them into the Gradient Boosting model, and constructed a combined model on the training set, namely deep learning radiomic signature (DLR_Sig).

Model assessment

In this study, we employed a one-versus-rest method, which is often applied in multiclass classification. We evaluated the model’s performance based on receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC). We used Precision, Recall, F1 score, macro-average, micro-average, and weighted average to assess the class of discrimination of one-versus-rest for the ovarian tumours of each group and the whole. A confusion matrix was used to analyse the errors in the model.

Statistical analysis

Statistical analysis was performed using Python (https://www.python.org/). Normally distributed variables are reported as mean ± standard deviation, whereas non-normally distributed variables are reported as median (interquartile range). Categorical variables are expressed as frequencies (percentages). One-way analysis of variance was used to compare the three data groups with normality and homogeneity criteria, and a rank-sum nonparametric test for multiple independent samples was adopted for variables with no normality and homogeneity. Categorical data were analysed using the chi-square (χ2) test. A two-sided P<0.05 was considered statistically significant.

Results

Patient characteristics

We included 849 patients in this study. Among them, 549 (64.66%), 55 (6.48%), and 245 (28.86%) had benign, borderline, and malignant ovarian tumours, respectively. The proportions of benign, borderline, and malignant ovarian tumours in the entire study group, training set, and testing set were approximately the same. The baseline characteristics are shown in Table 1.

Table 1 Training and testing sets of clinical parameters and semantic features of ultrasound

The ultrasound expert assessment the benign, borderline, and malignant ovarian tumours

Ultrasound specialists demonstrated a high level of accuracy in distinguishing between benign and malignant ovarian tumours, with rates of 95.80% and 82.80%, respectively. Conversely, the accuracy in identifying borderline ovarian tumours was notably lower at 34.50% (Table 2).

Table 2 The expert assessment the ultrasound images

The confusion matrix of the three-class classification prediction model

We used the confusion matrix to understand where the classifier model made the classification errors and their proportions (Fig. 4; Table 3). These multiclass classification prediction models had a high rate of correctly distinguishing benign ovarian tumours, 89.91%, 88.99%, 86.24%, and 82.57%, respectively). Clinic_Sig and Rad_Sig showed relatively poor accuracy in determining malignant ovarian tumours (16.33% and 38.78%, respectively). The classifier models Clinic_Sig and Rad_Sig cannot recognise BOT. The proportion of BOT identified by DLR was the highest at 54.55%.

Fig. 4
figure 4

Confusion matrix of three-class classification results based on the test set. (4a) Clinic_Sig; (4b) Rad_Sig; (4c) DTL_Sig; (4d) DLR_Sig. Class 0: benign ovarian tumours; class 1: BOT; class 2: malignant ovarian tumours. LightGBM, Light Gradient Boosting Machine

Table 3 The error analysis of the three-class classification prediction model

Classification performance

The DLR_Sig three-class prediction model had the best overall and class-specific classification performance, with the micro/macro average AUC 0.90 and 0.84 on the testing set, respectively. The categories of identification AUC were 0.84 for benign, 0.85 for borderline, and 0.83 for malignant ovarian tumours (Fig. 5; Table 4).

Fig. 5
figure 5

Three-class (one-vs-rest) ROC of the test set. (5a) Clinic_Sig; (5b) Rad_Sig; (5c) DTL_Sig; (5d) DLR_Sig. Class 0: benign ovarian tumours; class 1: BOT; class 2: malignant ovarian tumours. Micro- and macro-average ROC indicated the overall distinguishing ability of the three-class classification. LightGBM, Light Gradient Boosting Machine

Table 4 Overall and class-specific classification performance

Application of grad-CAM

Grad-CAM, which can produce a coarse localisation map highlighting the critical regions for classification targets, is proposed as a method for visualising the decisions of CNN models. The red areas of the heat map are crucial references for model decision-making [31]. The site of concern for US diagnosis is consistent with the area of concern for CNN decision making (Fig. 6).

Fig. 6
figure 6

The Resnet50 model with Grad-CAM was used on ovarian tumour patients. (6a–c) A solid hypoechoic mass in one patient’s pelvis, 100 mm in diameter. (6a) US image; (6b) Grad-CAM; the red area is the basis of decision-making for Resnet50; (6c) Histopathological results: theca cell tumour (40x). (6d–f) A cystic-solid mixed mass in a patient’s pelvis, 112 mm in diameter. (6d) US image; (6e) Grad-CAM; the red area is the basis of decision-making for Resnet50;(6f) Histopathological results: borderline ovarian tumour (40x)

In Fig. 6a, b, and c, there was a solid low-echoic mass in one patient’s pelvis, 100 mm in diameter. Rich blood flow signals were observed in and around the mass, with a CA125 of 8.67 U/ml. An US expert suggested that the patient had a malignant ovarian tumour. However, DTL_Sig predicted benign lesions with a probability of 97.35%. Pathological results showed that it was a benign theca cell tumour. The prediction of DTL_Sig was highly consistent with the pathological diagnosis.

In Fig. 6d, e, and f, there was a cystic-solid mixed mass in a patient’s pelvis, which was 112 mm in diameterand had a CA125 level of 206 U/ml. An US expert suggested that the patient had a malignant ovarian tumour. However, DTL_Sig indicated a BOT with an 85.98% probability. A pathological diagnosis of BOT was made. DTL_Sig prediction was highly consistent with the pathological diagnosis.

Discussion

The accurate prediction of the category of ovarian tumours is critical for patient-centred care. Studies on the multiclass classification of DLR to classify ovarian tumours are relatively scarce. In this study, we constructed four multiclassification prediction models to classify benign, borderline, and malignant ovarian tumours. We found that the DLR prediction model had the optimum ability to classify ovarian tumours and generalise the testing set.

Ultrasonography is the primary method for screening ovarian tumours. Serum tumour markers are essential for discovering and treating ovarian cancer, and CA125 is the most important biomarker for evaluating ovarian cancer [32]. Inflammation is vital in the development and progression of ovarian cancer [33]. Therefore, we collected US semantic features, serum tumour markers, and related inflammatory factors from the study population. These US semantic features and clinical parameters are typically obtained during routine examinations and do not add additional burden to the patient. We selected some semantic elements, serum tumour markers, and related inflammatory factors to construct Clinic_Sig. The Clinic_Sig three-class prediction model had poor overall and class-specific classification performance and could not predict BOT; the precision, recall, and F1 scores were all zero.

The US examinations were subjective. US experts have higher diagnostic accuracy than less experienced doctors; however, US experts are few [11]. Recently, radiomics has become a powerful new method for quantifying features from medical images, including potential pathophysiological information of reference cancer tissues [34]. Some studies have used MRI/CT/US-based radiomics to differentiate between benign and malignant ovarian tumours with higher diagnostic performance [15, 35, 36]. However, these studies did not mention the classification of BOT. Qi et al. [16] established and validated US-based radiomics models to discriminate between benign, borderline, and malignant serous ovarian tumours and provided preoperative diagnostic information to differentiate the nature of ovarian tumours. However, this was a binary classification study. In our research, the Rad_Sig three-class prediction model could not predict BOT, and the precision, recall, and F1 scores were all zero.

DL is becoming increasingly essential for image pattern recognition [21]. Considering the limited scale of medical datasets, we used TL to replace DL. TL is beneficial because it improves the performance of a model built on small samples by utilising the knowledge learned in similar classification tasks [28]. Gao et al. [25] and Christiansen et al. [11] developed a DTL model to identify benign and malignant ovarian tumours, equivalent to the diagnostic level of an US specialist. Chen et al. [12] developed DTL algorithms to distinguish malignant from benign ovarian tumours, comparable to expert subjective and ovarian adnexal reporting and data system assessments. However, they classified BOT as malignant ovarian tumours for statistical analysis. We used models pre-trained on ImageNet Resnet50 [11, 37]. The DTL_Sig three-class prediction model had good overall and class-specific classification performance, with the micro/macro average AUC 0.89 and 0.85 on the test set, respectively. Categories of identification AUC were 0.87 for benign, 0.82 for borderline, and 0.84 for malignant ovarian tumours. Although DTL performs well in various classification prediction tasks, it is a black-box algorithm that lacks interpretability, which restricts its application [31, 38]. Grad-CAM is employed as a method of depicting the decision-making of DL. In our study, as shown in Fig. 6, the site of concern for US experts making the diagnosis was consistent with the area of concern for CNN decision-making using Grad-CAM, and the DTL_Sig predictions were highly compatible with the pathological diagnosis results.

The combination of traditional manual radiomics and DTL algorithms, namely DLR, can effectively improve the accuracy and reliability of model predictions. It is currently a popular topic in ML for tumour research. Many studies [20, 38,39,40] show that the DLR model has a better prediction efficacy than Rad_Sig or DTL_Sig alone. The fusion process of data between traditional radiomics and DTL includes the fusion of features and decision levels, and the fusion of features often leads to overfitting because of many features [38]. We constructed a combined model for the training set by fusing the predicted probabilities of Clinic_Sig, Rad_Sig, and DTL_Sig for each sample. The combined three-class prediction model, DLR_Sig, had the best overall and class-specific classification performance, with the micro/macro average AUC 0.90 and 0.84 on the testing set, respectively. Categories of identification AUC were 0.84 for benign, 0.85 for borderline, and 0.83 for malignant ovarian tumours. The combined three-class prediction model performance for predicting BOT was the best, and the categories of identification AUC, Precision, Recall, F1 score, and accuracy had the highest performances of 0.85, 42.86%, 54.55%, 57.14%, and 93.31%, respectively. The prevalence of BOTs predicted by DLR_Sig (54.55%) exceeded that determined by ultrasound experts (34.50%).

This study had limitations. First, this was a retrospective single-centre study with a small sample size. Larger prospective and multicentre studies are required to evaluate the applicability of predictive models in clinical practice. Second, owing to the strict inclusion and exclusion criteria for data in this study, bias could have been introduced in the model’s training. Thirdly, in this study, we extracted features from two-dimensional US images. In future studies, we will include other modalities such as colour Doppler flow imaging, spectral Doppler imaging, and contrast-enhanced US to provide more predictive information. Lastly, ROI delineation and cropping of the top section of the tumour represented only one slice of the lesion and could not describe the heterogeneity of the entire tumour. In the future, we plan to store dynamic images of the whole tumour and input them into ML to obtain more comprehensive information.

Conclusion

We developed a combined multiclass classification model that integrated clinical and traditional radiomics with DTL decision-level information to discriminate the nature of ovarian tumours. The performance and generalisation of this model have intensified its feasibility for distinguishing between benign, borderline, and malignant ovarian tumours.