A machine learning approach applied to gynecological ultrasound to predict progression-free survival in ovarian cancer patients

In a growing number of social and clinical scenarios, machine learning (ML) is emerging as a promising tool for implementing complex multi-parametric decision-making algorithms. Regarding ovarian cancer (OC), despite the standardization of features that can support the discrimination of ovarian masses into benign and malignant, there is a lack of accurate predictive modeling based on ultrasound (US) examination for progression-free survival (PFS). This retrospective observational study analyzed patients with epithelial ovarian cancer (EOC) who were followed in a tertiary center from 2018 to 2019. Demographic features, clinical characteristics, information about the surgery and post-surgery histopathology were collected. Additionally, we recorded data about US examinations according to the International Ovarian Tumor Analysis (IOTA) classification. Our study aimed to realize a tool to predict 12 month PFS in patients with OC based on a ML algorithm applied to gynecological ultrasound assessment. Proper feature selection was used to determine an attribute core set. Three different machine learning algorithms, namely Logistic Regression (LR), Random Forest (RFF), and K-nearest neighbors (KNN), were then trained and validated with five-fold cross-validation to predict 12 month PFS. Our analysis included n. 64 patients and 12 month PFS was achieved by 46/64 patients (71.9%). The attribute core set used to train machine learning algorithms included age, menopause, CA-125 value, histotype, FIGO stage and US characteristics, such as major lesion diameter, side, echogenicity, color score, major solid component diameter, presence of carcinosis. RFF showed the best performance (accuracy 93.7%, precision 90%, recall 90%, area under receiver operating characteristic curve (AUROC) 0.92). We developed an accurate ML model to predict 12 month PFS.


Ovarian cancer
Ovarian cancer (OC) is the seventh-most-diagnosed cancer among women worldwide and the second-most-common gynecological malignancy. It represents appromixmately 14,000 deaths in 2020 in the US [1].
Up to 90% of ovarian cancers are epithelial ovarian cancer (EOC) types. OC has multiple cellular origins [2]. The term tubo-ovarian cancer is often used because OC can arise as an ovarian or fallopian-tube mass or primary peritoneal cancer [3].
Type I tumors (low-grade serous, mucinous, endometrioid, and clear cell) occurring in the ovary are less aggressive and are therefore more easly diagnosed at an early stage because they tend to grow slowly. Type II tumors (high-grade serous carcinomas (HGSC), undiffer-entiated carcinomas, and carcinosarcomas) may originate from the tubal and/or ovarian surface epithelium, and are more aggressive [4][5][6].
The absence of proper screening and diagnostic procedures to detect OC at an early stage as well as the rapid spread of disease through the peritoneal surface are leading factors in the OC lethality [7,8]. Nowadays, there is a lack of an accurate protocol to identify high-risk patients.
Therefore, identifying tools for accurate screening and early diagnosis and prognosis of OC represents a currently unmet clinical need.
In addition, the role of ultrasound (US) in OC is evolving. US is a cheap, non-invasive and well-recognized image modality for diagnosis and evaluation of OC [9].
The International Ovarian Tumor Analysis (IOTA) group established a standardized lexicon that includes all appropriate descriptors and definitions of the sonographic appearance characteristic of normal ovaries and ovarian lesions. To simplify the sonographer's assessment in differentiating benign from malignant adnexal masses, they also developed the Simple Rules classification system and the Assessment of Different Neoplasia in the Adnexa (ADNEX) model [10][11][12][13][14][15][16]. The Society of Radiologists in Ultrasound consensus statement [17,18] and the Gynecologic Imaging Reporting and Data System, also known as GI-RADS [19], are other proposed systems for the characterization and management of ovarian masses (OM) [20].
In 2018, the Ovarian-Adnexal Reporting and Data System (O-RADS) created a risk stratification classification for consistent follow-up and management in clinical practice [21].
But quickly, a simple description of the tumor and of its extension may not be sufficient. The application of precision medicine could help answering a question about early response to treatment, best timing for surgery, prognosis or molecularly targeted drug.

Machine Learning
In a growing number of social and clinical scenarios, machine learning (ML) is emerging as a promising tool for the implementation of complex multi-parametric decisionmaking algorithms [22,23]. In that sense, a ML approach is a potential gamechanger [24]. In fact, in addition to detecting linear patterns in analyzed data, it can unravel complex nonlinear relationships between patient attributes that cannot be solved by traditional statistical methods, merging them to produce a prediction or a probability for a given outcome [22,25,26].
ML is a step toward precision medicine, leading to improved patient profiling and personalized treatment. Supervised ML algorithms have been shown to be effective in predicting treatment responses and disease progression in patients affected with heterogeneous diseases [27,28].
Regarding OC, despite the standardization of features that can support the discrimination of ovarian masses into benign and malignant, there is the lack of accurate predictive modeling based on US examination for PFS.

Materials and methods
In this retrospective observational study, we analyzed consecutive patients with EOC who were followed in a tertiary center from 2018 to 2019.
Demographic features (age), clinical characteristics (parity, menopause, CA-125 value, genetic mutation state, treatment) were collected as well as information about surgery (surgical procedures, residual tumor) and post-surgery histopathology (histotypes, grading, FIGO stage). Additionally, we recorded data about transvaginal and/or transabdominal US examinations according to IOTA classification (unilateral lesion, side, largest diameter of lesion, type of tumor, echogenicity of cyst fluid in tumors, color score, diameter of largest solid component, shadows, ascites, carcinosis, subjective assessment).
Our study aimed to realized a tool to predict 12 month PFS in patients with OC based on a ML algorithm applied to gynecological ultrasound assessment.
In total, the original database included n. 64 patients and n. 22 variables.
Appropriate feature selection was used to determine an attribute core set (see Supplementary Materials for further details).
The ML algorithms were aimed at forecasting PFS at 12 month follow-up.
Student's t test for paired samples or Wilcoxon matchedpair signed-rank test were used as appropriate to identify difference between continuous variables between different observation periods. McNemar's test was used to identify the difference among dummy variables between.
The attribute core set used to train the algorithms was determined using a recursive feature elimination (RFE) wrapper based on a decision tree algorithm with extreme gradient boosting (XGBoost) [31]; in brief, this algorithm automatically selects from all the recorded attributes (n. 23) the best number of features on their importance for the given outcome predictions (PFS at 12 months). Feature selection can counteract overfitting problems and improve classification performance. RFE method is one of the commonly used feature selection methods for small samples problems [32][33][34] (For further details about RFE see Supplementary Materials).
The entire analysis was implemented in a Python 3.6 environment using scikit-learn (ver.0.22.1) and XGBoost (ver. 1.1.0) libraries [31,35]. After z-score normalization, we performed a Bayesian ridge conditional ridge imputation [36] for missing data. The latter method proved to be the most accurate method of imputation for obstetrics and gynecology datasets [37] (see Supplementary Materials for further details).
Three different classifiers, both linear and non-linear, were trained and cross-validated with five-fold cross-validation using the core set of attributes recovered from the RFE to predict 12 month PFS.
While logistic regression (LR) was almost always the algorithm of choice to find independent predictors in multivariate models, it must be noticed that the study hypotheses were usually based on the unrealistic assumption that the association between the prognostic factors and clinical outcomes is direct and isolated. In contrast, LR is not suitable for the modeling of non-independent variables. For this reason, along with usual LR, for linear modeling, we used the non-parametric K-nearest neighbors (KNN) and random forest (RFF) [36] algorithms. The latter models have recently been shown to accurately predict important outcomes for woman's health, even in the presence of non-linear patterns in data [38][39][40]. Furthermore, we choose RFF because there is evidence of accurate performance in case of unbalanced data, which is often the case of clinical datasets [41]. We also ran RFF using cost-sensitive training (using the argument class weight = "balanced" in scikit-learn) to try to overcome unbalanced class issue.
A repeated grid-search with cross-validation was used for optimal hyperparameter tuning to maximize the classifiers' performance [42] (See Supplementary Material for hyperparameter fine-tuning).
For each classifier, we plotted ROC curves, and then area under receiver operating characteristic curve (AUROC) was determined.
Then, based on the optimal probability cut-off (Youden's Index) [43] classifiers' performance was compared with the following metrics: In general, a classification model forecasts a binary outcome for a given observation and class. In the process of predicting, a model may output the probability of an observation belonging to each possible class. This case allows some flexibility in the way predictions are interpreted and presented, allowing the choice of a threshold, such as the afore-mentioned Youden's index [44].
For a model to be reliable, the estimated class probabilities should reflect the true underlying probability of the sample. To check these assumptions, a diagnostic calibration curve for the candidate best classifier was also plotted [44].
The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Scientific Board University of Bari, Bari, Italy. All patients had signed a consent to use the data in scientific purposes.

Results
Our analysis included n. 64 patients with diagnosis of EOC. Demographic and clinical characteristics, information about surgery procedures, post-surgery histopathology and US features are outlined in (Table 1).
The attribute core set used to train machine learning algorithms is reported in (Fig. 1). RFF showed an accuracy of 0.93, AUROC 0.92.
The final dataset had a dimensionality of 64 columns × 12 rows (n.11 selected attributes plus n. 1 target class (PFS at 12 months, as above mentioned).
In (Fig. 7), ROC curve for RFF (box A), LR (box B) and KNN (box C) models was reported.
In (Fig. 8) calibration diagnostic has been plotted for RFF; PFS roughly happened with an observed relative frequency consistent with the forecast value, showing an acceptable calibration curve. We would expect the match between predicted frequencies and observed frequencies to increase with a larger dataset.
We also reported the Odds ratios for the LR model for the interpretation of core set covariate associations in (Table 3).

Discussion
The keystone of survival analyses in cancer research has historically been Cox proportional hazard regression model, being a surrogate for estimating treatment efficacy and safety. This model is based on the assumption of linear association. However, many clinicopathologic features show a non-linear association in medicine [45].
The ML approach has recently brought an unprecedented growth of applications to medical imaging.
In the study of OC, since 1999, artificial neural networks [46,47] have been applied to classify US image into benign and malignant, but image features were manually measured and provided by the investigators.
In 2015, Kazendar et al. [48] developed a fully automatic ML classifier stratifying US images as benign or Recently, due to the wide availability of digital medical images and the technical advances in hardware and software, ML has also been applied in conjunction with radiomic analysis.
In a study by Chiappa et al. [49], ML and radiomics were applied to transvaginal ultrasonography (TUS) to implement a decision support system (DSS) for predicting the risk level of malignancy of OM.
The DSS was based on a set of three radiomic ML models, named as solid masses, cystic masses and mixed masses. These radiomic models were integrated with information about presence/absence of acoustic shadows and serum CA-125 level, considering two different thresholds according to menopausal status. The DSS was based on TUS imaging and serum CA-125 level and showed 91% accuracy, 100% sensitivity, and 80% specificity in independent tests.
Martinez-Mas et al. [50] realized a ML algorithms aimed to perform the automatic categorization of OC from US images. They analyzed 348 images. For each patient case and US image, its input features were previously extracted using Fourier descriptors calculated over the Region Of Interest (ROI). Then, four ML algorithms were considered to perform the classification stage: KNN, Linear Discriminant (LD), Support Vector Machine (SVM) and Extreme Learning Machine (ELM). LD, SVM and ELM reported more than 85% accuracy.  [51] aimed to develop ML models predicting platinum sensitivity in patients with HGSC. Using the stepwise selection method, based on the AUC values, six variables associated with platinum sensitivity were selected: age, initial serum CA-125 levels, neoadjuvant chemotherapy, pelvic lymph node status, pelvic tissue involvement other than uterus and tubes, and small bowel and mesentery involvement. Based on these variables, predictive models were constructed using four ML algorithms, LR, RFF, SVM and deep neural network. Evaluation of model performance using the five-fold cross-validation method identified the LR-based model as the best for identification platinum-resistant cases. Therefore, they developed a web-based nomogram adapting the LR model results for clinical utility.
Also attempting to improve treatment choices of OC patients, Shannon et al. [52] developed a ML tool to identify predictive molecular markers for cisplatin chemosensitivity.
CYTH3, GALNT3, S100A14, and ERI1 were the four potential biomarkers identified. Validation was performed on a cohort of n. 50 patients who underwent surgery followed by adjuvant carboplatin. Predictive models were established to predict chemosensitivity. The four biomarkers were also evaluated for their ability to prognosticate overall survival (OS) in three OC microarray expression datasets from The Gene Expression Omnibus. The extreme gradient boosting (XGBoost) algorithm was selected for the final model to validate the accuracy in an independent validation dataset (n = 10). CYTH3 and S100A14, followed by nodal stage, were the most important features. The signature of the four genes had a comparable prognosis to clinical information for two-year survival.
To date, only few studies attempted to apply ML to ultrasound evaluation of adnexal masses to predict benign or malignant histology.
On the other hand, some authors applied ML using only clinical and laboratory data to predict treatment response. To our best knowledge, this is the first ML algorithm basing on clinical, surgical, histophalogical and US features to predict PFS in patients diagnosed with OC.
The variables identified by the RFE as the attribute core set to predict the PFS had been already studied in literature.
In our cohort, age and menopausal status were negatively associated with PFS (Table 3). Consistently, Okunade et al. reported that age ≤ 55 years was an independent predictor of improved PFS [53]. In the study of Trifanescu et al., in premenopausal women, PFS was significantly higher than in post-menopausal ones [54].
In clinical practice, residual tumor is regarded as the most important factor for PFS [53]. Patients with absence of residual tumor after primary debulking surgery or interval debulking surgery have an increased PFS and  OS rates compared to patients with residual tumor [55]. However, in our study, this was not identified by ML as a predictor of prognosis. Of note running a LR for inferential purpose, residual tumor was found associated with PFS (OR 3.04, 95% CI 1.62-4.46, data not shown). Additionally, residual tumor was strongly correlated with high FIGO Stage in our cohort (Cramer's V = 0.91, data not shown). In this regard, on building a XGBoost-based RFE wrapper, it must be noticed that such multicollinearity is auto-handled and algorithm only keeps one of autocorrelated attributes for splitting trees [56]. This might explain why residual tumor was not included in the attribute core set.
The main limitation of our study is the low sample size, which in fundamental in ML research. Neverdless RFF as proven robust in previous studies with low or similar sample size [23]. To be adopted in clinical practice, the algorithm will need extensive external validation on larger prospective cohorts.
In gynecologic oncology, ML is a step toward precision medicine, leading to an improved patient profile and personalized treatment.
This model could be applied at the time of diagnosis to predict 12 month PFS in patients with OC. Ultrasound is a simple, non-invasive and inexpensive examination. The creation of a ML approach applied to gynecological ultrasound could allow to personalize the follow-up, stratifying patients according to the predicted PFS, intensifying the prescription of instrumental examinations in high-risk patients and reducing the request in low-risk patients.
This algorithm requires few easy-to-collect attributes. Further studies are needed to assess the potential of ML algorithms in routine gynecologic care.   Author contributions FA and VL: performed the study concepyion. GC, CL and DLaF: contributed to the study design. Material preparation and data collection were performed by CMS and MM. The first draft of the manuscript was written by FA and GC. CL, ES and DLaF: performed the data visualization. The manuscript was reviewed by FA, MM and CMS: under the supervision of GC. The project was administrated by VL, EC and FA. All authors read and approved the final manuscript.
Funding Open access funding provided by Università degli Studi di Bari Aldo Moro within the CRUI-CARE Agreement.
Data availability Data are not freely available due to local Ethics Committee privacy issues. Authors will consider data sharing upon specific request to local Ethics Committee.

Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.

Ethical approval
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of Azienda Ospedaliera Policlinico Consorziale-University of Bari, IT (protocol code 6398, date of approval 10.06.2020).

Consent to participate
Informed consent was obtained from all subjects involved in the study at baseline consultation.

Consent to publication
The authors affirm that human research participants provided informed consent for publication of the images in Figs. 2, 3, 4, 5, 6.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.