1 Introduction

Gynecologic (GYN) neoplasms (2022 ICD-10-CM code: C51-C58) occur in the female genital tract (FGT) and are referred to as tumor sites: vulva (C51), vagina (C52), cervix uteri (C53), corpus uteri (C54), “uterus, part unspecified” (C55), ovaries (C56), “FGT, other/unspecified” (C57), and placenta (C58) (dos Santos et al. 2023). Notwithstanding the mutual system, each depicts distinct features that, in turn, lead to long lists of differential diagnoses and prognoses (Malla et al. 2021; Hsiao et al. 2021). Pathogenesis involves a variety of intrinsic and extrinsic risk factors (RFs) above and beyond inherited faulty genes; for instance, (i) endometrial and ovarian carcinogenesis is attributed to older age (Momenimovahed et al. 2019; Zhao et al. 2021), high adult body mass index (BMI) (Cramer 2012), and/or hormone-replacement therapy exposure (Liang et al. 2022; Liu et al. 2019; ii) cervical carcinogenesis is strongly attributed to high-risk human papillomavirus (HR-HPV) infection and low socioeconomic status (Hull et al. 2020). Such RFs are inherently tied to behavioral, sociodemographic, and environmental factors, which emulates the unequal burden distribution of gynecological malignancies worldwide.

Endometrial (EC) and ovarian cancer (OC) epidemiology is nearly entwined, with a greater prevalence in high-income countries (HICs), notably in Northern and Western Europe, North America, and industrialized populations (Cramer 2012; Zhang et al. 2019; Nees et al. 2022). Under the International Agency for Research on Cancer (IARC) GLOBOCAN series, Canada and the United States of America carry the heaviest endometrial cancer burden (Brüggmann et al. 2020). By 2040, EC global incidence rates are set to leap over 50% from 2018 (Zhang et al. 2019). Ovarian cancer gauges for fewer cases (Cramer 2012), yet it is three times more lethal than breast cancer (Momenimovahed et al. 2019) as OC burden often fences in its complications, comorbidities, and treatment. The same report indicated that, by 2040, OC incidence and mortality rates are set to increase by over 36% and 50%, respectively, from 2020 (Chardin and Leary 2021; everhobbes n.d.).

Cervical cancer (CxCa) is a serious global health concern in low- and middle-income countries (LMICs), wherein 85% of cases and 87% of deaths occur (Rahman et al. 2019). Women with CxCa usually suffer from more complex and acute conditions than other severe illnesses (Krakauer et al. 2021), engendering 740 deaths per day (LaVigne et al. 2017). Again, the emerging World Health Organization (WHO) reports affirm alarming clues of an estimated 442,926 deaths by 2030 (Rahman et al. 2019), over 95% are expected in LMICs (LaVigne et al. 2017)—an upsurge despite the availability of HPV vaccination and cervical screening and management (CSM) programs (Gravitt et al. 2021). To overcome current discrepancies, WHO General Director released a global call to action for CxCa elimination as a public health issue in 2018 (Wilailak et al. 2021). In August 2020, the 73rd World Health Assembly (WHA) endorsed it (Krakauer et al. 2021).

Albeit the surfeit of GYN cancer data, the true burden remains unascertained (Medhin et al. 2020). Documenting the global burden is afflicted by the scarcity of local and national cancer registries in resource-limited settings (LaVigne et al. 2017; Varughese and Richman 2010). Commonly, underdiagnosed or underreported cases in semi-urban or rural areas are substantial impediments to proper assessment for international agency resource allocation (Varughese and Richman 2010; Fiscutean 2021). In 2021, Fiscutean (2021) report raised serious concerns regarding the gap between reality and official OC rates. Although ovarian cancer appears fairly rare in Africa, experts reckon that official records skip vast swathes of individuals and that women, no doubt, die without a diagnosis (Fiscutean 2021). Further, increasing westernization in lifestyles and behaviors is compelling emerging nations to carry the brunt of a heavier burden (Li et al. 2022). Conforming to IARC, OC alone is expected to rise over 87% in Africa by 2040—an eminent soar than anywhere else in the world (Fiscutean 2021). Whether the burden is truly proportionate to the actual overall incidence or a reporting bias, GYN oncology has evolved into a multifaceted global health challenge needing enhanced and scalable analytical approaches throughout the “precision medicine” era. Herein, data mining is introduced.

Data mining (DM), a sub-process of knowledge discovery in databases (KDD), is capturing actionable ‘hidden patterns’ from huge repositories (Idri et al. 2018a). It is perceived as a model-building process that incorporates an amalgamation of techniques and, thus, ‘robust’ models drive domain grasp and decision-making (Duque et al. 2023). Such techniques could be broadly categorized as either: (i) non-machine learning techniques, predicated on classical statistical analysis such as linear regression, principal component analysis, or discriminant analysis; and (ii) machine learning (ML) techniques, elemental of artificial intelligence (AI) in the form of, for instance, a neural network, case-based reasoning, or decision tree (Idri et al. 2018b). Data mining reckon on the aforesaid techniques, but the pursued purpose is distinctive: predictive (supervised learning, “labeled” data) or descriptive (unsupervised learning, “unlabeled” data) (Idri et al. 2018a).

Predictive analytics uses current or historical circumstances to forecast explicit values from data of interest (Razzak et al. 2019). Its implementation yields credence to tasks such as classification and regression (Idri et al. 2018a). Conversely, descriptive analytics looks for patterns and relationships in a given set of data to get a meaningful understanding (Razzak et al. 2019). Unlike predictive counterparts, descriptive models are investigative and do not attempt to predict new properties (Razzak et al. 2019). Clustering and association rules are thus designated under descriptive mining (Idri et al. 2018a).

All analytics are paramount to oncology, revolutionizing every facet of clinical practice (Luchini et al. 2021). As the health sector strives toward automation, increasingly sophisticated cancer-related knowledge could empower intelligent systems and assuredly solidify oncology and DM as viable partners in the near future. By dint of extensive heterogeneity, FGT tumors are an optimum realm for DM breakthroughs, yet little progress has been achieved. One reason for this could be unsatisfactory quality or reporting of the actual evidence, limiting its clinical applicability. The scope of implemented DM-ML in GYN oncology remains imprecise or whether advanced computational techniques enhance scientific understanding, as is the extent to which it goes beyond “numeric”.

Building on a prior mapping study (Idlahcen and Idri 2022), the present systematic review aims to assess the depth and breadth of DM literature in GYN malignancies during the past decade, with an emphasis on ML-based approaches. Matter of fact, systematic maps (SMSs) serve a key role in narrowing down a broader topic across perusals pertinent to valuable questions (Idlahcen and Idri 2022). Likewise, SMSs tend to be well-suited for outlining research gaps, depicting research trends, or bracing an exhaustive systematic literature review (SLR) (Idlahcen and Idri 2022). Notwithstanding analogous characteristics, SMSs and SLRs are distinguishable in goals and analysis schemes (Idlahcen and Idri 2022). While an SLR aims to achieve an evidence synthesis, an SMS is particularly concerned primarily with visually structuring literature into tabular and graphical summaries (Idlahcen and Idri 2022). With this in mind, the strength of the present SLR is two-fold:

  • Synthesize evidence of DM-ML literature relevant to GYN oncology by searching extensively the research published between January 1, 2011, and February 28, 2022, in five digital resources: PubMed, IEEE Xplore, ScienceDirect, Springer Link, and Google Scholar.

  • Analyze the selected articles through eight aspects:

  1. 1.

    Publication trends, venues, and sources where the selected articles have been issued.

  2. 2.

    Discerning the commonest tackled GYN sites-derived neoplasm with DM-ML.

  3. 3.

    Classifying the selected articles according to contributions and empirical schemes.

  4. 4.

    Determining the most popular data modalities and openly accessible repositories relevant to GYN oncology.

  5. 5.

    Discerning the most encountered GYN oncology tasks by DM-ML.

  6. 6.

    Determining the most pursued DM-ML aim in the literature.

  7. 7.

    Highlighting the most adopted DM-ML techniques in the literature.

  8. 8.

    Determining the most adopted performance framework to assess DM-ML models.

We anticipate the present systematic review would disclose literature gaps and ease foster the knowledge and awareness of DM-related GYN oncology practices among oncologists and data practitioners.

The remaining document is outlined as follows. Section 2 describes the research methodology, and how articles were selected and assessed. Next, the findings of the RQs are delineated in Sect. 3 and discussed in Sect. 5. Then, Sect. 6 suggests recommendations about what the audience should address based on the research findings. Section 7 sums up the present SLR.

2 Methods

A systematic literature review (SLR) is a structured, transparent, and exhaustive scientific tool to get the “lay of the land” of a topic (Greyson et al. 2019; Martinic et al. 2019). It is, therefore, a well-defined and protracted protocol that requires expertise to be conducted (van Haastrecht et al. 2021). SLR policy traces back its genesis to medicine (or evidence-based practice) (van Haastrecht et al. 2021). Since then, it has expanded across all disciplines, actively computer science subfields such as software engineering (Kitchenham et al. 2004), artificial intelligence (Sarana and Subhashini 2023), and medical (health) informatics (Hosni et al. 2019; Wadghiri et al. 2022; Melton 2017). The first Evidence-Based Software Engineering (EBSE) paradigm was proposed by Kitchenham (2004) Kitchenham (n.d)., then, through embracing (Brereton et al. 2007) strategy (2007), Kitchenham and Charters (2007) supplied clear guidelines (Keele et al. 2007) for systematic reviews’ conduct.

The following sections describe the methodology used throughout this paper.

2.1 Study design and protocol registration

The current study was undertaken as an SLR in compliance with Kitchenham and Charter’s guidelines and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Page et al. 2021). The PRISMA Protocols (PRISMA-P) checklist (Moher et al. 2015). The study protocol has been registered with PROSPERO (Schiavo 2019), the international registry for systematic reviews. No ethical approval or informed consent was mandated.

2.2 Research questions

The goal of the present SLR could be explicitly translated into research questions (RQs) as follows:

  • RQ1. In which years, publication venues, and sources were the selected articles published?

  • RQ2. What is the most tackled GYN site-derived neoplasm using DM-ML?

  • RQ3. In which contribution and empirical schemes are the selected articles categorized?

  • (RQ4.1) What are the prevailing data modalities to investigate DM-ML as an adjunct to GYN oncology? (RQ4.2) Which open datasets are commonly used to investigate DM-ML as an adjunct to GYN oncology?

  • RQ5. Which medical tasks of GYN oncology practice are addressed with DM-ML?

  • RQ6. What DM-ML objectives are fulfilled for GYN oncology research?

  • RQ7. Which DM-ML techniques are commonly implemented in GYN oncology settings?

  • (RQ8.1) Which performance metrics are prevalent to evaluate DM-ML techniques as an adjunct to GYN oncology? (RQ8.2) Which validation techniques are prevalent to evaluate DM-ML techniques as an adjunct to GYN oncology? (RQ8.3) Clinicians vs. DM-ML? (RQ8.4) How perform DM-ML in GYN oncology (A case of cervix image classification for screening)?

Through answering our RQs, we seek to supply a systematic overview of the in-depth and meticulous analysis of DM-ML implications in GYN oncology.

2.3 Search strategy

Five digital libraries: (i) PubMed, (ii) IEEE Xplore, (iii) ScienceDirect, (iv) Springer Link, and (v) Google Scholar, were searched for peer-reviewed and English-language articles published from January 2011 until February 2022. We crafted the search query guided by two schemes: (i) PICO (Population, Intervention, Comparison, and Outcomes) synthesis recommended by Kitchenham and Charters and (ii) extraction of key terms/concepts from the RQs. Since no empirical assessments and tangible outcomes were conducted in the present SLR, only the initial two letters of PICO acronym and analogs were gleaned. The full scheme is supplied in Supplementary Material 1.

2.4 Study selection

An article of interest is verified as a primary study if it meets all the inclusion criteria and none of the exclusion criteria. The following inclusion (I) and exclusion (E) criteria for eligibility were used:

  • IC1. The article regards novel, or improving current, DM-ML/ GYN oncology.

  • IC2: The article scrutinizes DM-ML/ GYN oncology.

  • IC3. The article regards empirical/theoretical scrutiny of DM-ML/ GYN oncology.

  • IC4. The article pertains to [January 1st, 2011–February 28, 2022].

  • EC1. The article covers other kinds of cancer besides FGT sites.

  • EC2. The article covers polycystic ovary syndrome (PCOS) or endometriosis.

  • EC3. The article regards primarily novel, or improving current, data pre-processing approach.

  • EC4. The article regards radiomics, pathomics, or radiopathomics material.

  • EC5. The article is not presented in English.

  • EC6. The article is a duplicate.

  • EC7. The article is no longer than 6 pages (incl. references, well-marked appendices).

  • EC8. Extended abstracts, posters, or editorials.

  • EC9. The article is a non-peer-reviewed material.

  • EC10. The full-text is unavailable.

Subsequent to discarding duplicate entries through EndNote reference management software, articles were carefully screened to reject or retain the paper based on metadata, i.e. titles, abstracts, and keywords; or otherwise, the full text if metadata is unsatisfactory. The evaluation step was carried out independently by the two authors (AI and FI).

2.5 Quality assessment

Each article was retained upon four QAs: (i) QA1 assesses the clarity of the empirical results provided in the study by means of the research questions, (ii) QA2 assesses both the clarity of the empirical design and the systems set of the experiment, (iii) QA3 assess the adoption of performance measures to evaluate the study outcomes, and (iv) QA4 scoring attests the rank of the article: (i) for conferences, the ranking is based on the Computing Research and Education Association of Australasia (CORE Conference Ranking Exercise 2021), and (ii) for journals, the ranking is based on the Web of Science Journal Citation Reports (JCR 2020). We regarded article quality scoring as an exclusion criterion. Note that a similar strategy was adopted by Idri et al. (2015). (Supplementary Material 2: Tables S2A, S2B, and S2C) provide the QA criteria checklist and a full list of the selected articles and venues, including corresponding QA scores.

2.6 Data extraction

At this point, each eligible article was subjected to a data extraction framework to address the aforementioned RQs. The following are the joint fields.

  • Publication Year: calendar year.

  • Publication Venue Petersen et al. (2015): journals, conference proceedings (or book chapters).

  • Source Name: official title of the publication venue.

  • GYN neoplasm(s), in which DM-ML techniques have been performed, was/were identified for each study. The neoplasm(s) site could, for example, be ovarian, endometrial, cervical, vaginal, vulvar, etc.

  • Contribution Types (Wieringa et al. 2005): (i) Evaluation research (ER), a paper evaluating empirically a given DM-ML technique (new or existing). (ii) Solution proposal (SP), a paper proposing a DM-ML solution approach. It must therefore be novel, or markedly improve an existing approach. (iii) Validation research (VR), a paper exploring a DM-ML technique in clinical practice. (iv) review, a paper assessing the literature of DM-ML as an adjunct to GYN oncology. (v) Others, e.g. experience reports, philosophical papers, opinion papers.

  • Empirical Methods (Petersen et al. 2015): (i) Case study (CS), an empirical evaluation carried out in a real-life medical setting. (ii) Historical-based evaluation (HBE), a paper assessing the performance of a specified DM-ML technique throughout existing datasets. (iii) Survey (S), a paper based on a dataset garnered through a questionnaire.

  • Data modalities could, for example, be a whole-slide image (WSI), miRNAs-mRNAs expression profiles, GC-MS serum metabolomic fingerprint, hysteroscopy captures, mid-infrared (MIR) spectra of biofluids, cervicograms, DCE-MRI pharmacokinetic parameters, etc.

  • Publicly available dataset name.

  • Six core medical tasks (Esfandiari et al. 2014) (i) Screening (Sc), detecting illness or body dysfunction prior to as-yet-unrecognized symptoms or markers. (ii) Diagnosis (Dx), recognizing the character of a disease from its symptoms and markers. (iii) Treatment (Tx), undertaking measures to cure a patient; diagnosis outcomes serve to determine a suitable treatment. (iv) Prognosis (Px), predicting the likelihood of a disease’s course or outcome, esp. recovery, recurrence, survival, and quality of life (QoL). (v) Monitoring (M), surveilling a patient’s condition throughout time. (vi) Management (Mx), promoting health and medical services.

  • Four DM-ML objectives are considered. (i) Classification, categorizing into classes a given set of data; the output is categorical, e.g. ’case’, ’control’. (ii) Regression, predicting continuous or real values from a given set of data; the output is numerical, e.g. BMI, temperature. (iii) Clustering, disclosing inherent groupings in a given set of data, e.g. grouping patients by health perception. (iv) Association rule mining (ARM), identifying frequent patterns, co-occurrences, and correlations in a database, e.g. the relationship between LMICs and cervical cancer.

  • DM-ML techniques could, for example, be deep neural networks (dNNs), support vector machine (SVM), decision trees (DTs), k-nearest neighbors (k-NN), etc.

  • Performance metrics could, for example, be an accuracy (Acc), sensitivity (Sen), specificity (Spe), F-measure, area under a receiver operating characteristic curve (AUROC), mean absolute error (MAE), etc.

  • Validation techniques could, for example, be k-fold cross-validation (Kf-CV), leave-one-subject (patient)-out (L1SO-CV), Monte Carlo (MC-CV), Hold-out, etc.

  • The performance of the selected sample techniques.

2.7 Threats to validity

The core threats to the present SLR validity are narrated in the following two aspects.

  1. 1.

    Study selection bias. To capture the maximum evidence and avoid any identification bias(es), we constructed a search strategy at two levels. First, gather pertinent keywords from the research questions under PICO framework. Second, exert iterative testing in which we examined the effect of alternate terms, stemming/wildcards, Boolean search conventions, and the use of assigned terms (MeSH terms) for a broader and refined search string. This resultant was performed across five digital libraries, PubMed, IEEE Xplore, ScienceDirect, Springer Link, and Google Scholar, with adjustments as necessary according to each database. Given the cross-disciplinary nature of the topic, it was prudent to utilize multifaceted libraries to access a broader range of pertinent papers, encompassing both “computer science” and “biomedicine” literature. The Google Scholar search engine functioned as a “secondary source” to ensure we rarely miss out on an article.

  2. 2.

    Internal validity, or potential for bias(es), is a particularly important feature of systematic reviews where the scope to which the conduct achieves trustworthy results. It deals with data extraction and analysis (Idri et al. 2016). To lessen the threat of imprecise or erroneous “data extraction”, the form was completed independently by the two authors. Disagreements or doubts were settled by in-depth team discussions (AI and FI) to establish inter-rater reliability.

3 Results

In this systematic review of all DM-ML/ GYN oncology, we found 3499 candidate articles. 181 of which fulfilled eligibility criteria, were in-depth analyzed to provide valuable insights.

3.1 Study selection and characteristics: an overview

Fig. 1
figure 1

PRISMA (preferred reporting items for systematic reviews and meta-analyses) flow diagram of the literature selection scheme

The total articles retrieved at each stage of the study selection is depicted in Fig. 1. In a prior systematic map (Idlahcen and Idri 2022), a total of 2807 primary studies published between January 2011 and June 15, 2021, were identified by searching five digital libraries as granted in Sect. 2.3. The updated search yielded 692 extra candidate articles published between June 16, 2021, and February 2022, which were appended to the earlier number, amounting to 3499 primary studies. 3142 articles were discarded following the ICs/ECs, retaining 357 articles emphasizing the application of DM-ML in GYN oncology. At last, we used the quality assessment on the 357 relevant articles. Due to low-quality scores, the process resulted in the discard of 176 articles. While 50,7% (181 art.) of the selected articles had a higher average score as per Table 1, 49.3% (176 art.) were below and thus subtracted. Ultimately, 181 articles were used.

The included articles’ data extraction is summarized in full in (Supplementary Material 3: Table S3).

Table 1 Quality scores and levels of the selected articles

3.2 RQ1. In which years, publication venues, and sources were the selected articles published?

Fig. 2
figure 2

Year’s distribution of the selected articles per FGT site. Ca. Cancer

The majority is published in journals (93.92%, 170 art.). Optimal ones (Table 2) to publish on DM-GYN oncology cope with disciplines as follows: (i) computer science -AI, (ii) obstetrics (OB)/GYN, or (iii) biomedical (clinical or health) informatics. By virtue of our QA, not CORE-ranked conference proceedings were excluded, leading to only (6.07%, 11 art.) being captured in the selection study. Accordingly, journal analysis revealed an interdisciplinary nature within the field under consideration. Figure 2 has more information on the distribution of the selected articles. We discern notable fluctuations in the yearly publication count. From 2011 to 2014, it is subjected to a gradual increase from one to ten published articles respectively, followed by a drop to eight articles each in 2015 and 2016, a progressive emerges over 2017–2018 (11–13 art.), then a sharp spike onward 2019, reaching 66.85% (121 art.) of the selected articles by 2022. While cervical neoplasm research contributed significantly to the 2019 soar by 60.53%, the other kinds remained marginally steady.

Table 2 Publication venues/sources of the selected articles

3.3 RQ2. What is the most tackled GYN site-derived neoplasm using DM-ML?

Table 3 FGT neoplasms

GYN tumors are a heterogeneous collection of masses arising from the female reproductive organ(s)—featuring five main types: cervical, ovarian, endometrial, vaginal, and vulvar (Lõhmussaar et al. 2020). After data extraction, we encountered several distinct types, incl. uncommon forms. To keep the report concise and easy for users to interpret, we elected to shorten GYN masses into six classes from both FIGO systems (Kehoe and Bhatla 2021) and “International Classification of Diseases (ICD)” standpoint as follows: (i) cervix uteri; (ii) ovary, Fallopian tube, and peritoneum; (iii) uterus; (iv) vagina; (v) vulva; and (vi) placenta; as shown in Table 3. Recall that a single article could examine multiple GYN neoplasms.

We witness a high focus for “cervical” neoplasms (57.6%, 118 art.), followed by “ovarian, tubal, and peritoneal” ones (24.9%, 51 art.), then “uterine” (15.1%, 31 art.) class. At last, “vaginal”, “vulvar”, and “placental” represents 1%, 0.5%, and 1%, respectively. Regarding the “ovary, Fallopian tube, or peritoneum” site, ovarian tumors garnered the most attention with 40 articles, followed by adnexal (6 art.), tubal (3 art.), and peritoneal (2 art.). While sarcoma and leiomyoma each account for five and four per uterine articles, endometrial tumors reckon 20 articles, alongside two additional art.: (i) one arising from the round ligament, and (ii) an unstated fibroid type. Such constitutes respective 64.51%, 16.13%, 12.9%, 3.23%, and 3.23% of the uterine site. Within placental tumors, throphoblastoma and choriocarcinoma are both covered in a single article.

3.4 RQ3. In which contribution and empirical schemes are the selected articles categorized?

Fig. 3
figure 3

Contribution types’ distribution of the selected articles per year

Wieringa et al. (2005) classify research/contribution types into six broad categories: (i) evaluation research (ER), (ii) solution proposal (SP), (iii) validation research (VR), (iv) philosophical papers, (v) experience reports, and (vi) opinion papers. 84.2% (155 art.) of the selected articles were SPs either providing a new approach or improving an existing one; of which all were empirically evaluated. A further 13% (24 art.) were classified solitary as ERs assessing the performance of current or novel DM-ML. Since extensive research is labeled similarly to solution and evaluation proposals, the number of ER articles steadily exceeded that of SP papers—particularly onwards 2012. Of note, five articles (2.7%) were VR studies.

Figure 3 displays the distribution of contribution types by year. The topmost are ERs that continued to rise with the fastest expansion from 2016 to 2020 (108 art.), thus the least declines were over 2014–2016 (10 to 8 art.) and 2020–2021 (45 to 39 art.). As only articles up to February 2022 were gathered, it is unsatisfactory to induce a statement pertaining to 2022. SPs had an akin tendency, on average, up to 2017 and beyond 2018 as a little dip was noticed during 2017–2018 when compared to ERs variations. The bottommost is for VRs accounting for one article optimum, yet distribution has not altered substantially in a decade.

Three empirical types could be outlined: (i) historical-based evaluation (HBE), (ii) case study (CS), and (iii) survey. Of the empirically evaluated articles (Table 4): (i) 36.72% conducted an HBE through publicly available datasets. Such type was used in 51.77%, 47.52%, and 0.71% of ER, SP, and VR papers, respectively; (ii) 61.72% adopted a CS as it was conducted in its medical real-life context, 52.32%, 45.99%, and 1.69% amongst were respective ER, SP, and VR studies; and (iii) surveys were employed in 1.56% of the articles, of which two SPs, three ERs, and one VR.

Table 4 Distribution of the selected articles per contribution and empirical types

3.5 RQ4.1. What are the prevailing data modalities to investigate ML as an adjunct to GYN oncology?

Table 5 Medical modalities used in the selected articles explained
Table 6 Sequential order of common tests and referred modalities for cervical cancer screening/diagnosis

Table 5 provides insights on the investigated medical data modality(ies). Recall that medical imaging encompasses most modalities (58.3%, 130 art.). It is also worth noting that 85.08% of the selected articles regarded follow-up biopsy as ground truth (Table 6).

3.6 RQ4.2. Which open datasets are commonly used to investigate ML as an adjunct to GYN oncology?

Table 7 denotes the publicly available datasets.

Table 7 Publicly available datasets

3.7 RQ5. Which medical tasks of GYN oncology practice are addressed with DM-ML?

Fig. 4
figure 4

Distribution of the selected articles per medical task

Figure 4 Illustrates the examined medical tasks in the present selection study. No “Management” medical task was undertaken by any of the selected articles. It should be noted that a single study could examine multiple medical tasks.

3.8 RQ6. What DM-ML objectives are fulfilled for GYN oncology research?

Fig. 5
figure 5

Distribution of the selected articles per DM-ML objective

Figure 5 displays the frequency of DM-ML objectives in the present selection study. Recall that more than one objective could be examined in a single study.

3.9 RQ7. Which DM-ML techniques are commonly implemented in GYN oncology settings?

Fig. 6
figure 6

Distribution of the selected articles per DM-ML techniques. NN, neural networks; SVM, support vector machine; k-NN, k-nearest neighbor; RF, random forest; LR, logistic regression; DT, decision tree; NB, naïve bayes; LDA, linear discriminant analysis

Figure 6 depicts the proportion of DM-ML techniques adopted per objective. Each technique has its strengths and limitations as summarized in Table 8. Figuring in 127 of the selected articles, neural networks (NNs) were the most prevalent DM-ML approach. Of these, 116 and 11 art. were committed to classification and regression, respectively. Solely to classification, NNs were followed by support vector machine (SVM), random forest (RF), k- nearest neighbors (k-NN), logistic regression (LR), and decision trees (DT), respective accounting for 79, 37, 37, 36, and 17 art. The rest of the classification techniques englobe LDA, naïve bayes (NB), adaboost, XGBoost (respective 14, 15, 7, and 7 art.), and others such as QDA, PLS-DA, BN, GBM, E-net, GCN, GNN, ELM to a lesser extent. NNs, comprise three types mainly: (i) standard artificial neural network (aNN), convolutional neural network (CNN), and (iii) recurrent neural network (RNN). CNNs gathered the most attention with 66 art. starting from AlexNet to Densenet-121. In regression, NNs were followed by LR, RF, SVM, DT, and XGBoost, accounting for 9, 8, 3, 2, and 2 of the selected articles, respectively. The rest of the regression techniques, such as GBM and CART, were less examined at 3% (8 art.) for each. Besides, k-means was the most prominent DM-ML clustering approach, with 12 articles implementing it. The remaining clustering techniques included: fuzzy c-means (FCM), partitioning around medoids algorithm (PAM), and hierarchical methods, e.g. hierarchical cluster analysis (HCA). Yet the ensemble algorithm for clustering cancer data (EACCD), which was designed in 2009 by D. Chen et al. to enhance the TNM staging system without altering its fundamental concepts, was observed in both Grimley et al. (2021) and Praiss et al. (2020) studies. Association-related articles employed four DM-ML techniques: fuzzy association rules mining (FARM), Apriori, and constrained-rule learning algorithm.

Table 8 DM-ML techniques frequently used in the selected articles explained

3.10 RQ8.1: Which performance metrics are prevalent to evaluate ML techniques as an adjunct to GYN oncology?

Fig. 7
figure 7

Distribution of the selected articles per performance metrics. TPR, true positive rate; TNR, true negative rate; PPV, positive predictive (or prognostic) value; AUROC, Area Under the Receiver Operating Characteristics; PRAUC, area under the precision-recall curve; AUC, area under curve

Figure 7 presents the used benchmark metrics to evaluate the selected articles’ models. Such are classified into (i) single scalar, e.g. accuracy, sensitivity, specificity, etc., or (ii) graphical, e.g. AUROC, AUPRC, etc. The five commonly used metrics were: (i) accuracy (Acc) (125 art., 17.7%); (i) sensitivity (Sen), recall (Re), or true-positive rate (TPR) (136 art., 19.2%); (iii) specificity (Spe) or true negative rate (TNR) (101 art., 14.3%); (iv) area under the receiver operating characteristics (AUROC) (78 art., 11%); (v) precision (Pr) or positive predictive (prognostic) value (PPV) (71 art., 10%); and (vi) F-measure (39 art., 5.5%). Nevertheless, each article tends to use a variety of performance metrics and the common combination was ‘accuracy, sensitivity, specificity, and AUROC’.

3.11 RQ8.2: Which validation techniques are prevalent to evaluate ML techniques as an adjunct to GYN oncology?

Fig. 8
figure 8

Distribution of the selected articles per validation schemes

Validation schemes are used to legitimize the performance of a DM- ML model fairly and accurately. Figure 8 depicts the validation strategies reported in the present selection study, namely, (i) cross-validation (CV), (ii) hold-out or data split, and (iii) bootstrapping, accounting for 62.43% (123 art.), 28.42% (56 art.), and 2.03% (4 art.) of the selected articles, respectively. Variants such as bootstrapped Latin partition (BLP) and Kennard-Stone (K-S) algorithm were each reported in one study. In 6.1% of cases, the validation process was unstated. Recall that a single article could undergo multiple validation techniques.

As per CV, five types were used, (i) Monte Carlo (MC-CV) in one study, (ii) k-fold (k-fCV) in 93 studies, (iii) stratified k-fCV in two studies, (iv) leave-one-subject(patient)-out (L1SO-CV) in 3 studies, and (v) leave-one-out (LOO-CV) in 14 studies. The ‘k’ value was either 5 (40.7%) or 10 (49.5%) to a greater extent, instead observed were the values of 2, 3, 6, 7, 9, and 15. Of note, the CV type was unstated in 10 studies.

3.12 RQ8.3: Clinicians vs. DM-ML?

17 out of 181 articles compared the performance of the implemented DM-ML models to that of clinicians. The assessments predominantly engaged board-certified seniors with professional experience spanning from 5 to over 25 years in the domain of interest or at different professional levels (interns, in-service, and professionals). The participants, who were blinded to the pathological and clinical findings of the study, frequently conducted independent reviews of the test sets in a randomized order. The commonly used metrics were Acc, Sen, Spe, AUC, and/or interobserver agreement.

3.13 RQ8.4: How do perform DM-ML in GYN oncology (A case of cervix images classification for screening)?

Table 9 Estimation of performance values of the 10-articles’ models

Due to the selection study’s heterogeneity, we carried out an overall evaluation of the performance of DM-ML techniques used in 10 selected studies which presented 260 empirical evaluations. These selected articles are (i) ranking a quality score of 5, and (ii) relevant to cervical cancer, classification, screening, and all associated imaging, i.e. liquid-based cytology, pap smear, and cervicograms. We selected the commonly used performance metrics, i.e. accuracy, sensitivity, specificity, f-measure, and AUC, to assess the reviewed techniques as reported in Sect. 3.10. The reported values were retrieved for the selected measures to synthesize all of the evaluations of each technique. An evaluation equates to using a DM-ML technique on a dataset, which means that an article on distinct datasets provides multiple evaluations. (Supplementary Material 4: Table S4) reports the performance metrics of the 10 selected studies and Table  9 as summary. Although not representative, we observe as follows:

  • CNNs was the most assessed with 49 evaluations (9 art.).

  • The highest accuracy values were attained by aNNs, CNNs, and SVM, with respective means of 93.22%, 92.65%, and 88.73%, and low related standard deviation values.

  • Bns(C) and k-NN also achieved good accuracy values, with respective means of 83.63% and 85.23% for the same number of evaluations, 44 each. Still, high related standard deviation suggests that the values are dispersed across a greater range, raising questions about their performance.

  • aNNs reached high sensitivity and specificity values also, indicating their strength in test diagnostics.

  • CNNs may tend to falsely find a result even though they attained high specificity values due to the high associated standard deviation.

4 Discussion

4.1 DM: Changing the landscape of GYN oncology

The publications count soared markedly in 2019. The overall trend is consistent with three hypotheses: (i) deployment of digital pathology (Hanna et al. 2020), “omics” (Adamo et al. 2018), and datafication in clinical and non-clinical settings (Manteghinejad and Javanmard 2021), (ii) “precision oncology” era opened up novel pathways to urge AI-DM integration, and (iii) 2018 WHO DG global call-to-action to eliminate cervical cancer as a public health issue (Wilailak et al. 2021). Prior to 2017, no regulatory approval was issued for whole-slide scanners in primary surgical pathology, resulting in virtual microscope glass slides standing in stark contrast to digitizing radiology since the 1980s (Boyce 2017). Besides, large-scale pan-omics repositories, e.g. The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO), Clinical Proteomic Tumor Analysis Consortium (CPTAC), and so forth, are rapidly expanding to accommodate multi-omics with the improvement and reduced cost of mass spectrometry (MS) and high-throughput next-generation sequencing (NGS) technologies (Morganti et al. 2019; Girolamo et al. 2013). As data is a vital prerequisite, a such substantial upsurge gave impetus to the development of novel ML strategies that integrate multifaceted and pertinent datasets beyond questionnaires and classical radiology. Following 2018 WHO DG-plea, over 70 countries acted swiftly and positively toward “hidden” cervical cancer, which may partially explain the overall peak of publications in 2019—prominently on cervical cancer—as reflected in Fig. 2.

4.2 Aligning DM with GYN cancer disparities

Cervical cancer is a paradigm of global health disparity reckoning higher morbimortality rates than the combined uterine and ovarian malignancies (Medhin et al. 2020). Thus, it is apparent that “cervix uteri” was the most examined FGT site. But while the CxCa burden is readily recognized in LMICs, endometrial cancer represents a major global health disparity in developed countries (Medhin et al. 2020). Yet, “corpus uteri” endometrium was little examined. Such scrutiny entails more research as recent US evidence (Smrz et al. 2021) corroborates a perennial increase in EC incidence rates, pointedly young-onset age. Ovarian neoplasms were the second most examined, haply by cause of lethality associated with poor survival rates than those of uterine and cervical malignancies (Nie et al. 2021). Only a few investigated the less common forms of uterine and adnexal masses, but despite the lesser extent, it is still important for (i) a differential diagnosis between degenerated leiomyomas and uterine sarcomas in magnetic resonance imaging (MRI) (Suzuki et al. 2019) and (ii) case–control of adnexal masses in ultrasound images as the first-line imaging modality to satisfactorily patient triage (Holsbeke et al. 2012)—prospective research is therefore desirable. The remaining types, incl. (i) vaginal as reported by (Tian et al. 2019) and (Liu et al. 2021; ii) vulvar, round ligament, throphoblastoma, and choriocarcinoma as reported by Liu et al. (2021); (iii) tubal as reported by Liu et al. (2021), Laios et al. (2020), and Vázquez et al. (2018); and (iv) primary peritoneal as reported in Laios et al. (2020) and Vázquez et al. (2018); were incorporated alongside common types, which is insufficient to induce a statement relevant to such kinds.

We draw attention to a singularity, (i) (Kumar et al. 2014) examined the pre-/post- treatment quality of life (QoL) in CxCa patients, (ii) (Zhang and Han 2020) investigated an under-researched cancer patient population, which is pregnant or lactating women, and (iii) (Asaduzzaman et al. 2021) implicated patients suffering from ovarian or cervical cancer, with stress (Mental) disorder too. Since it is fundamental to analyze an individual as a “whole” not independently as “parts” or “organs”, it is constantly appealing to target a multimorbidity-based or a whole patient-based sample population to unravel pathogenesis insights. Besides, Kumar et al. (2014) and Asaduzzaman et al. (2021) pinpointed valuable yet neglected aspects in GYN tumors patients, i.e. mental disorders and quality of life after therapy. As it stands, extensive research is required to investigate the ability of ML-based prediction algorithms in terms of (i) capturing behavioral morbidity or subjective psychological response patterns across risk factors and (ii) forecasting the quality of life in the functional, symptom, and global health/QoL scales for finer treatment modules. Similarly, maternal tumors raise several concerns regarding the mother-fetus safety as well as the cause-specific survival of pregnant and lactating women with cancer or post-cancer, so it is of special interest to address the use of learning techniques to improve detection and care in obstetrical settings.

4.3 DM: a paucity of clinical validation

While medical data expansion paints an overall positive picture, existing DM-ML techniques still need to integrate with medical data conformations, which is congruous with the large portion of solution proposals scrutinized (84.2%). As advanced approaches are proposed to attain even better performance, expectedly, all amongst were empirically evaluated. However, we are particularly concerned about the paucity of validation research (5 art.). Since building and deploying DM models differ, it is crucial to cushion what might adversely strike the proposed model in real-world clinical settings. Such scarcity asserts that the implementation of DM techniques in GYN oncology is still limited to academic research, more research is needed to raise maturity.

It is appealing that most of the selected articles (97.28%) were empirically evaluated, with 60.03% referring to case studies, since it reflects researchers’ awareness of the value of cooperating with practitioners for successful trials and thereof generalizing across different cohorts with disparate parameters. The next 36.72% used historical-based data, accounting for 47.52% of solution proposals owing to both reliability and ease of open-access datasets use when evaluating SPs performance. Yet, the wide adoption of CS could also raise the question of the scarcity of availability and sharing of databases related to GYN Oncology. Barriers to biomedical data sharing remain major hindrances to availability, utilization, and inter-institutionalization and have been well-documented by van Panhuis et al. (2014). The resulting taxonomy concern: (i) technical barriers, including lack of data collection, no prioritization in data preservation or archiving, restrictive data formats, lack of metadata, standards, or interoperability (e.g. structure or language), along with a scarcity of analytical solutions to address such challenges although identified, hinder the potential for integrating several datasets, particularly in an international context. (ii) motivational barriers rooted in personal or institutional beliefs, including lack of incentives, opportunity costs, worry of criticisms or discredits, and disagreement on data secondary access/use. Solutions entail the establishment of trust or the formulation of transparent legal agreements. (iii) economic barriers concerning the feasibility and actual costs and damages related to data sharing and potential solutions hinge on acknowledging the value of data and implementing sustainable financing mechanisms. (iv) political barriers ingrained within health governance systems often manifest as issues related to trust, restrictive policies, and lack of guidelines. Addressing them is not a clear-cut endeavor, requiring national and global processes to foster agreement and political will. (v) legal barriers related to ownership, copyrights, and privacy protection. Resolving them relies on underlying political barriers. (vi) ethical barriers involving conflicts between moral principles and values and readily apparent in issues related to reciprocity and proportionality. Engaging in a global dialogue on the ethical principles and model legislation governing data share could resolve it. In-depth evidence is needed to broaden the understanding on sharing biomedical data. As knowledge of barriers deepens, the scope for solutions and efforts would enlarge for the benefit of global health.

Only 1.56% of the articles are based on surveys due to the strain of obtaining exhaustive and accurate information from questionnaires. Previous literature (Jo 2022) acquiesces three assumptions: (i) survey structure, (ii) narrow knowledge, and (iii) refusal to answer because of personal, sensitive, or uncomfortable concerns, notably related to GYN cancer survivors QoL.

4.4 Barriers to DM in GYN oncology

Medical imaging is a force to be reckoned with in modern medicine, it provides a DM foreground application, yet it is no wonder imaging was the most used with 58.3% of the selected articles. Yet, to join  Sect. 4.3, DM-ML is still restricted. The real hurdle encountered in medical data is the inability to use straightforward approaches without any type of pre-processing to disable what might have detrimental effects on the developed model.

As for image-based data, whole-slide imaging has (i) the highest noise, (ii) staining fluctuation, (iii) artifacts, (iv) a tendency to surpass the stacking capability of GPUs, and (v) an intra- and interobserver heterogeneity, which might yield poor perception in either manual or computer-assisted assessment. Hence, implementing WSIs is a decisive endeavor requiring, at a minimum, stain normalization, artifact correction, and patch-based analysis. As hematoxylin-eosin (H &E) and immunohistochemistry staining reagents and protocols vary considerably between laboratories and often fade over time (Downing et al. 2019), it is of urgent necessity to explore prospective multi-center cohorts with varied manufacturers’ chemicals, standards, magnifications, and further modern pathology RQs. Next, digital captures from cervicography and hysteroscopic video clips come with poor quality due to blur, incomplete cervix or uterus exposure, lesions coated by vaginal discharge, heavy bleeding, and so forth (Yuan et al. 2020), resulting in a loss of clinically relevant features and a pronounced class imbalance. Digital captures entail generalization as well over multistate scenarios—notably regarding the used contrast agents in colposcopy. However, only three articles (Yuan et al. 2020; Asiedu et al. 2019; Yu et al. 2021) examined D-VIA with either D-VILI, saline, or built-in green filters. Besides, most research neglects the sequential process of colposcopy-based precancerous lesion recognition by considering only a single cervicogram, which makes it difficult to reflect the acetowhite changes in compliance with the diagnostic customs of gynecologists. Herein, Li et al. (2020) and Peng et al. (2021) provide successful cases of cervical intraepithelial neoplasia (CIN) diagnosis from time-lapsed cervicograms based on respective (i) fusion approaches at distinct post-acetic test time slots, i.e. 60 s, 90 s, 120 s, 150 s; and (ii) multimodal feature changes of registered pre- and post- acetic acid test images. Recall that most of the articles of the current selection study converted the special medical image file formats to JPEG, which is not required when loss of the finest details is forbidden. Such might alter the diagnostic accuracy in contrast to learning directly from DICOM (Urushibara et al. 2021) or Aperio SVS.

As for feature-based data, missing values are inevitable—particularly in electronic records and questionnaires (see  Sect. 4.3). One straightforward approach to account for it is to remove the related instances, yielding usually to imbalanced and smaller datasets. As it stands, data must be pre-treated by either oversampling the minority class or predicting the missing values, which again is limited to biased findings. Besides, biomedical data has both multimodal and high-dimensional properties, involving common inconsistencies, redundancy, and noise in raw data. Such matter leads to the “curse of dimensionality’—for instance in transcriptome profiling, where miRNA expression microarray data contains massive genes with a little number of observations. It is therefore vital for a gene selection to minimize its size by selecting informative ones regardless of the DM techniques used.

Of note, it is worthwhile to highlight the efforts made by researchers to investigate multifaceted and cutting-edge medical data.

4.5 DM: going beyond diagnostics

“Diagnosis” underwent a higher level of scrutiny (42%), not unexpectedly with the intent of granting patients with ideal diagnostic scenarios: (i) inter- and intra-observer discrepancy-free, (ii) non- or minimally invasive, and (iii) early detected.

Visual examination of biopsied tissues is the “gold standard” for proper cancer diagnosis, yet pathological findings might differ strikingly even within a single specimen due to observer variability from human error (Genta 2014). To avert misdiagnoses, results are always subject to an expert second opinion as mandated by most pathology laboratories’ policies (Farooq et al. 2021). As the “tipping point” of digital pathology (DP), ML-based CADx(s) are able now to supply inter-rater consensus and consistent “second opinion” to clinicians. CADx(s) could even be of significant practical value in GYN oncology since FGT neoplasms exhibit salient intratumoral heterogeneity (ITH), a confounding trait driven in clinical diagnostics (Yin et al. 2019). As it stands, most studies supported the ability of EL-ML- and DL-classifiers to offer gyneco-pathologists refined interpretation for all OC, CxCa, and EC (BenTaieb et al. 2017; Sun et al. 2020; Downing et al. 2019; Yu et al. 2020a; Guo et al. 2016; Huang et al. 2020c; Li et al. 2019; Zhang et al. 2021a; Zeng et al. 2021; Cheng et al. 2020; Yang and Stamp 2021; Huang et al. 2020a; Meng et al. 2021; Xue et al. 2020a; Shin et al. 2021; Xue et al. 2019a; Zhang et al. 2021b). Counterintuitively, diagnoses are not restricted to cancer cases or healthy control, yet baseline report items wrap the histologic type, tumor grade, FIGO stage, tumor margin, lymph node status, and so forth. The foregoing exhibits subjective facets within GYN tumors, for instance, the inconsistency of differentiating endometrial adenocarcinoma from complex atypical hyperplasia within a biopsy, hysterectomy, or curettage material (Allison et al. 2008). Herein, Sun et al. (2020) proposed HIENet to classify endometrial tissue WSIs into “normal endometrium”, “polyp”, “hyperplasia”, and “adenocarcinoma”. The proposed strategy outperformed three human experts and five ML classifiers, which offers prospects for alleviating discrepancies noticed in endometrial tumor grading. Ergo, the adoption of ML techniques is projected to obviate workloads and shortages while permitting a more objective measure of clinical and surgical pathology. Subjectivity issue apart, invasiveness is debated in acquiring biopsy tissues.

Discovering non-invasive biomarkers has been a priority in cancer research for three decades, and ML-radiomics nomograms play a pivotal part in its pace (Jha et al. 2022)—H &E-based computational biomarkers are imminent and revolutionizing (Lancellotti et al. 2021). In a retrospective study of 435 women, Kawakami et al. (2019) confirmed the power of popular ML classifiers in epithelial ovarian cancer diagnosis using 32 preoperative blood biomarkers before any initial intervention. Song et al. (2018) explored 16 serum biomarker profiles in search of alternative marker pairings to lessen misdiagnosis. In another study, Aljakouch et al. (2019) revealed the potential of all Raman spectra, CARS, SHG, and TPF imaging as fast non-invasive CxCa diagnosis tools. Elias et al. (2017) suggested circulating miRNAs to establish a non-invasive OC diagnostic test, if it is validated, it could also screen women at high risk. Other research for precise diagnoses through non- or minimally invasive modalities has been undertaken, e.g. serum metabolomic fingerprints (Troisi et al. 2020), MRI (Urushibara et al. 2021; Nakagawa et al. 2019), US (Acharya et al. 2012), etc. Such synergy with DM might be a viable solution, notably, in (i) preoperative lymph node metastasis (LNM) diagnosis (Chen et al. 2021; Bedrikovetski et al. 2021; Wu et al. 2020) and (ii) worrisome-benign masses or cysts cases (Wasnik 2013). As per LNM, non-invasive and preoperative tools are compelled to discern between low-risk patients and those with substantial nodal involvement—a process generally supported by the presence of non-sentinel-lymph-node involvement in pelvic and para-aortic lymph nodes (Reijnen et al. 2020). Withal, non-invasive alternatives to biopsy referral eliminate adverse scenarios, e.g. patient discomfort, mental pressure, hemorrhage, soreness, etc (Bagaria et al. 2021). Recall that diagnoses govern therapy, postoperative prognostication, and management of cancer patients (Smeltzer et al. 2021), so properly performing ML-ML-diagnostic tools impacts the in-follow routines.

Early detection greatly affects survival rates and “avoidable deaths” (Wardle et al. 2015). It is supplied with two constituents: (i) screening and (ii) early diagnosis, or downstaging. Due to the dismal prognosis of GYN malignancies, screening (36%) is pivotal with forecast demand outweighing that of diagnosis. But while regular HPV-DNA testing, cervical cytopathology, direct visual inspection, and colposcopy, could preclude cervical cancer, screening for endometrial and ovarian cancers remains controversial. Till now no viable screening option is available for EC (Troisi et al. 2020). As per screening OC, transvaginal ultrasonography is often used in conjunction with the cancer antigen (CA)-125 biomarker (Henderson et al. 2018), yet neither has been found accurate in routine clinical use as increased preoperative serum CA-125 levels are diverse by other conditions beyond ovarian cell proliferation (Song et al. 2018). A reason behind “all except fifteen” screening-related articles aligning to cervical lesions, the remainder pertains to EC, OC, and leiomyoma. However, incorporating ML algorithms in cancer research has introduced a promising form of triage and screening, along with a unique opportunity to overcome umpteen limitations of either existing or potential screening tests. From a public health standpoint, it is imperative to leverage ML schemes for the early detection of GYN cancers. Here, Troisi et al. (2018) pilot study and related multicenter clinical validation (Troisi et al. 2020) evinced a non-invasive and low-cost endometrial cancer screening system based on metabolomics signature and ensemble classifiers. Mabwa et al. (2021) suggested the potential inclusion of dried biofluids mid-infrared spectroscopy as an EC screening tool based on ML classifiers. In another study, Barnabas et al. revealed the potential of microvesicle proteomic biomarkers in identifying all Stage-I ovarian lesions. Tsai et al. (2014) study provides the first evidence to associate ZNF509 and MTMR2 oncogenes to ovarian cancer through SVM. Elias et al. (2017) denoted that miRNAs recognized by neural networks are also found in the fallopian tube epithelium early lesions, suggesting the likelihood of pre-metastatic screening. This is significant in the sense that high-grade ovarian cancer is often detected at metastatic stages due to no early-stage-specific biomarkers and asymptomatism.

GYN cancerous cells often metastasize within the peritoneal cavity (Burg et al. 2020), with a range of events stemming from treatment regimens (Horvath et al. 2013) and disease advanced-stage. Hence, the selected articles coping with the treatment task (6.1%) frequently had connections to outcomes, complications, relapse, risk stratification, and/or decision-making of individualized treatments, noticeably: (i) predict the procedure outcome of uterine artery embolization for uterine fibroids as either symptom resolution/improvement or worsening symptoms (Luo et al. 2020), (ii) predict residual disease (R0) resection following cytoreductive surgery for advanced epithelial ovarian cancer (aEOC) (Laios et al. 2020), (iii) predict rectovaginal or vesicovaginal fistula formation when using interstitial high-dose-rate brachytherapy for locally advanced or recurrent GYN tumors unapt for intracavitary applicator treatments (Tian et al. 2019), (iv) comparing pre-treatment, post-treatment, and normal control cases by dint of OC DNA methylation data (Jiang and Liang 2020), (v) predict rectum dose-toxicity for CxCa combined external beam radiotherapy and brachytherapy (Zhen et al. 2017), (vi) predict local relapse and distant metastasis for locally advanced CxCa patients with definitive chemoradiotherapy-treated (Shen et al. 2019), (vii) predict post-treatment health related QoL in CxCa patients on the functional, symptom, and global health/QoL scales (Kumar et al. 2014), (viii) predict whether patients with locally advanced CxCa would be cured or relapsed after CRT (Torheim et al. 2014), (ix) predict CxCa recurrence by dint of DNA methylation data (Ma et al. 2021), (x) predict adverse events of radical hysterectomy in FIGO IA2–IIB CxCa patients (Kusy et al. 2013), (xi) predict platinum-free interval (PFI) of serous ovarian cancer patients by dint of histopathology and functional omics (Yu et al. 2020a), (xii) identify preoperatively EC patients at risk for LNM and poor outcome for personalized adjuvant treatment (Reijnen et al. 2020), (xiii) predict venous thromboembolism events in OC patients, e.g. venous thromboembolism (VTE) and deep vein thrombosis (DVT) (Fresard et al. 2020), and (xiv) identify high-risk of mortality OC patients by dint of genetic signatures (Hsiao et al. 2021).

Prognosis garnered 13.6% as ML-based prognostic prediction is a crucial paradigm in the realm of personalized medicine. It uncovers notably prognostic/predictive biomarkers (incl. candidate oncogenes and tumor suppressor genes), tumor recurrence, survival, and withal individualized treatment improvement (Laios et al. 2020). Monitoring received the least attention (2.3%); it requires therefore in-depth scrutiny as a vital facet of patient care entailing an innovative revolution. No management tasks-related articles were perceived.

4.6 Aligning DM analytics with GYN oncology

In theory, screening and diagnosis medical tasks are categorization matters encompassing several hurdles like FIGO staging, subtyping, TNM system, and so forth. It is therefore no wonder classification remains the subject of most research as far as screening and diagnosis are concerned in line with Sect. 3.7 data extraction. Regression (12.76%) comes second, accounting for prognosis-related use in the present selection study. Conventionally, traditional statistics analyses have been the mainstream of oncologic prognostic research, yet they are ineffective in coping with high-dimensional data and non-linear correlations (Rajula et al. 2020). Even though DM approaches are mostly based on probability and statistics, they are more resilient than parametric models allowing complex pattern identification across massive datasets with multi-input variables as accustomed to clinicians (Laios et al. 2020). Likewise, in silico ML-based regression approaches are gaining deeper insights - particularly in (i) asserting pathogenic (Cytosine-phosphate guanine) CpG sites and corresponding genes, where statistical approaches are unsuitable for DNA methylation (DNAm) data, as reported in Jiang and Liang (2020) study, and (ii) clinical outcomes such as disease-free survival (DFS), which is the utmost challenging endpoint in oncology, as reported in Wu et al. (2020) study. Clustering (11.06%) methods appear to be very useful in prognostic systems and gene expression assays. Further research of genetic clusters depending on miRNA expression patterns within training cohort samples could improve outcomes and proffer a better understanding of functional genomics (Lopez et al. 2018; Oyelade et al. 2016). Association analysis (2.98%) also permits the pinpoint of expressed methylated gene relationships in distinct samples and the shift from individual components to associative patterns amongst predisposing genes (Mallik et al. 2013; Lee et al. 2021). Hence, ARM-based algorithms distinctly improve tumor prediction and carcinogenesis, as Mallik et al. (2013) have proven for uterine leiomyoma prediction.

Most research adopted NNs, SVMs, and RFs. NNs are adequate to discern patterns, manage large or noisy data sets, and interestingly, learn in a manner reminiscent of the human brain (Yang and Wang 2020). Amongst the most implemented NN classes are CNNs. CNNs offer great prospects for addressing conventional medical CADx hurdles (Chan et al. 2020) due to their ability to (i) operate as both a feature extraction module and a classifier, (ii) to benefit from transfer learning to overcome poor generalization, overfitting, and the need for large databases supplements, more particularly for confirmed cases, and (iii) to cope with high-dimensional, heterogeneous, and complex input data as with biomedical imaging, particularly the arduous whole-slide imaging. For instance, Yang and Stamp (2021) preprocessed and patched uterine tissue biopsy WSIs from 250 potential LGESS patients for classification, in which DL outperformed classical ML models with the highest accuracy. Xue et al. (2019b) demonstrated the effectiveness of ResNet as a baseline classifier for the CIN grade classification problem using segmented and GAN-generated epithelium patches. Shin et al. (2021) a successful case of Inception V3 generalized through a style transfer technique in small image sets within the domain of ovarian digital pathology. Cheng et al. (2020) generated a structured report to support the interpretable analysis of cervical pathology images by using CNN-extracted features as input for an LSTM. Meng et al. (2021) proposed assembling a CNN-based classification, segmentation, and pseudo-labeling approach to enhance the performance on unlabeled cervical histopathology data. Zhang et al. (2021c) experiment used ResNet50 as the feature extractor coupled with several classical ML models to perform cervical tissue pathological images. Yu et al. (2020b) showcased the utility of CNNs in identifying tumor cells, and classifying both tumor grades and transcriptomic subtypes. Sun et al. (2020) developed a CAD based on a CNN and attention mechanisms to model the complicated relationships between endometrium histopathological images and related interpretations, where classical ML often fail to achieve satisfying findings. Huang et al. (2020b) investigated a novel cervical cancer classification method, analyzing the impact of fusing convolutional features from various depths and combining different classifiers on the accuracy of pathological image classification. Despite the success of DL in tackling several limitations of classical ML, most of the aforesaid raised a pressing need for causability to achieve a level of explainability and interpretability in DL pathology, aiming to develop a framework that can be seamlessly applied in clinical settings or within an inter-institutional heterogeneity as DL exhibits poor performance when used on disparate data from external sources. As performances of DL networks are proportional to the number of hidden layers—the greater the depth, the more robust a CNN Mostafa et al. (2022), going deeper would exacerbate deep neural network issues, which are akin to a ‘black box’ Mostafa et al. (2022). The inability to be interpreted yields NNs untrustworthy by the medical community. On the other hand, SVM is a potent approach used for classification and regression analyses that accommodates biological data due to their non-linear character such as genomics profiles (Huang et al. 2018; Sidey-Gibbons and Sidey-Gibbons 2019). It is recognized by the scientific community that SVM is an accurate classifier superior to neural classifiers Hirschberg et al. (2020). Yet, SVMs are greedy in terms of memory usage as well as computational time (Guo and Boukir 2015). A powerful scenario is to combine the SVM classifier with the bottleneck extracted features by a CNN as seen in Shao et al. (2020), Sun et al. (2020) studies or in the aforesaid instances. RF was used in 45 studies as it provides a robust classification with little overfitting and is simple to comprehend (Sarica et al. 2017). A further benefit of RFs is the identification of important features through the typical use of the Gini score, which allows biomarkers pinpointing as seen in Kawakami et al. (2019) study.

4.7 Assessing DM performance in GYN oncology

In literature, the preferred performance metrics are accuracy, sensitivity, specificity, and AUROC which, in turn, are in the present selection study. Whereas accuracy and precision disclose the test’s general reliability, both specificity and sensitivity highlight the likelihood of false positives (FPs) and false negatives (FNs) (Eusebi 2013). Pertaining to biomedical data, accuracy might not constitute a realistic measure since higher scores do not matter to evaluate the model as a proper medical task. Considering a simple triage of patients within imbalanced data, the model might enjoy a higher accuracy in scoring the dominant samples class, but sacrificed sensitivity in predicting the minority class (Tian et al. 2019; Wu and Zhou 2017). To get a profuse explanation, extending the evaluation by including other metrics is a must (Wu and Zhou 2017). Meanwhile, mean absolute error (MAE) and concordance index (c-index) were suitable in regression models by providing the ratio of the real and predicted value, which perfectly matches such type of task. Due to the easy but valuable handling of variations in oncological data, cross-validation was utilized largely by 62.43% of the selected articles as seen in Fig. 7. Next in line is the ’hold-out’ method, comprising 28.42% of the selected articles, as it involves the straightforward task of dividing a dataset into subsets. But while hold-out is less computationally costly, cross-validation schemes provide more meaningful insights regarding model performance on more complex and unseen data at every point, which circumvents any data bias concerns.

The ongoing shortage of human assessments underscores the inadequacy of research in DM as an adjunct to GYN oncology. More evidence, including comparative studies with clinicians, is needed, extending well beyond a mere emphasis on collaboration. This is crucial to substantiate the practical necessity in real clinical practice and to enhance DM role comprehension. For instance, Urushibara et al. (2021) demonstrated CNN potency in assisting radiologists in the interpretation of pelvic MRIs. The interobserver agreement between three radiologists (9–27 experience years) and CNN was observed to be lower than among the radiologists themselves. Such discrepancy suggests that the model might have employed a distinct perspective compared to humans, leading to differing judgments. Zhang et al. (2021d) demonstrated that the diagnostic performance of VGGNet-16 in classifying five endometrial lesions was on par with that of expert gynecologists, providing hysteroscopists with objective diagnostic evidence and potential clinical utility. Further, enhancing colposcopy precision is a crucial aspect of CIN management. Even among experienced colposcopists, the sensitivity for detecting CIN through colposcopy ranges from 81.4 to 95.7%, while specificity ranges from 34.2 to 69%. In comparison, Yuan et al. (2020) revealed that a DL diagnostic system exhibited superior performance when compared to colposcopists in the analysis of standard cervicograms for HSILs. However, its performance showed a slight decline when dealing with high-definition cervicograms. Bao et al. (2020) joined with AI-assisted cytology, achieving a level of performance in the detection of CIN2+ that was comparable to that of skilled cytologists in referring women. Integrating AI-assisted cytology into primary cytology screening, the most prevalent application in pathology, or HPV-positive triage, would improve specificity while not at the expense of sensitivity. As interobserver agreement remains a critical issue in GYN oncology, both studies have proved the potential of a stable and objective CADx system to lessen interobserver error and potentially mitigate the learning curve for less experienced gynecologists. The computerized models could also exhibit a more time-efficient performance as seen in Chen et al. (2020) study, where the DL network model, derived from MR imaging, delivered a competitive and time-efficient diagnostic performance in identifying MI depth. Exploring the DM-ML performance in therapy and its role in aiding clinical decision-making is needed. Further validation through human assessments might predict the malignancy risk of masses before surgical procedures (Gao et al. 2022) and enable better patient identification for those who can benefit the most from treatments, as seen in the promising results of Luo et al. (2020) study regarding the prediction of outcomes in uterine fibroid embolization.

Of note, it is worthwhile to highlight the efforts made by researchers to validate the implemented models.

5 Related work

In spite of the booming interest in Gynecologic AI, the topic has been marginally contemplated only in a handful of reviews. The related work primarily explores the topic of the convergence of AI and GYN oncology or categorizes it as a distinct subfield within software engineering.

Three astoundingly valuable systematic reviews scrutinized the application of AI algorithms in the context of female oncology. Such reviews encompassed varied directions, posing a challenge when attempting direct comparisons with the current review. Shrestha et al. (2022) consolidates the current state of the art in the field of gynecologic cancer imaging and provides insights for moving research and clinical integration of the analytics forward. The authors came with two pertinent challenges: generalizability and reproducibility while raising concerns about the paucity of clinicians vs. AI assessments and the relatively limited research conducted in the field of Gynecologic AI in comparison to Breast AI. Such discrepancy is particularly striking when comparing the volume of publications in the domains with the aggregated mortality rates of gynecologic cancers versus breast cancer. Still, only imaging was analyzed, which not only inhibits generalizability but pinpoints the need for further identifying other modalities. This was also apparent in Xue et al. (2022) study restricted even more to only cervical and breast cancers through DL which require further pathologies from the same anatomical site. Yet, the authors identified kindred observations in alignment with the current SLR in relation to (i) bolstering the current evidence through high-quality research such as prospective cohorts and clinical trials; (ii) generalizing prophase DL on larger and multi-institute data with a range of nationalities, ethnicities, socioeconomic status, and so forth; (iii) integrating DL into real-clinical settings throughout comprehensive healthcare systems while ensuring robustness and scientific validation for clinical and personal benefit; (iv) conducting a comparative analysis of DL against human clinicians, esp. sensitivity and specificity, as the existing study designs and reporting remain poor, which could lead to an overestimation of algorithmic performance; and (v) developing global and standardized reporting guidelines for DL in medical imaging to streamline the integration into clinical practice. Akazawa and Hashimoto (2021) revealed that a pioneer in CADx could involve utilizing Pap smears and digital colposcopy for the diagnosis of gynecologic lesions while also raising the abundance of single-institute retrospective studies concern and the high heterogeneity found. Still, the authors’ tentative assertion is based on only 71 studies within three databases and without quality appraisal for relatively poor designs and reporting, which underscores the necessity for more rigorous and comprehensive scrutiny of the available data.

Herein, it is imperative to embark on the available DM-ML quest and subsequently enhance its capabilities for heightened accuracy in potential clinical unmet ethically. Thereupon, continued reliance on systematic reviews is warranted to pinpoint research gaps, flaws, and limitations. Such extends not only to the prioritization of technology-specific reviews but also underscoring the significance of disease(site)-specific systematic reviews (Shrestha et al. 2022). While intelligent algorithmic remains in a perpetual state of evolution, our aspiration was to provide recommendations and catalyze further exploration within the topic. Such effort is intended to accelerate the adoption of the most potent technology into clinical practice at the earliest opportunity.

6 Implications for practice

Summarized evidence acknowledges the inclusion of ML modules in GYN oncology. Still, research has centered on sophisticated ML models rather than data peculiarities and unmet medical needs. An insufficient clinical trial validation could further activate fault lines, yielding attempts likely to be unsustainable systems in oncology settings. Arising from (yet not limited to) the findings of this SLR, pivotal considerations for proper DM-ML implementation in GYN oncology are:

  1. 1.

    Generalizability (akin to external validity). The ability of an ML model to execute satisfactorily on new, independent cohorts across disparate populations, protocols, and deployment environments. When the model fails to meet this definition, it has learned biases. Lingering concerns center on too small datasets, restricted to single institutional/regional sampling, exhibiting a population non-diversity, noise/artifacts, or variables in settings. Such datasets could be socio-demographics skewed, with a process chain inadvertently encoding biases, thus not representative of the broader population. Although recommended, generalizability is not always within reach due to ethics, budgeting, and many more considerations in data sharing. Hence, hospital collaborations using model-customization techniques are needed to translate site-specific trained models to new contexts, else generalizability remains seldom practiced in literature.

  2. 2.

    Prospective Cohorts. Data assessment of recruited subjects before outcomes, thus proving temporality, is a core factor in asserting causality. A prospective design is valued more highly in the evidence pyramid than a retrospective one (historical cohort) in which events have already occurred. It is a potent study design inferring the true benefit of ML schemes in real-world clinical settings and yielding, often, generalizable outcomes. Although encouraged, it is not always financially practical, as a prospective cohort demands eligible subjects to be followed up over a lengthy time frame. Bias or confounding ought to be prevented to the greatest extent, particularly regarding losses or individuals dropping out. Prospective cohorts remain seldom practiced in literature.

  3. 3.

    Randomized Controlled Trials. Prospective trials in which subjects are randomly assigned to either an intervention or control group. Such are often blinded and occur after the recruitment and eligibility phasis. The act of randomization is the best clinical evidence to assess an intervention’s effects while mitigating confounding factors. While it is not practical to formally assess each model iteration through a randomized controlled trial, it would clearly quantify the clinical value of an ML system that has already reached the clinical implementation phase. Further, it provides another level of evidence to encourage ML adoption in clinical practice, as a model utility and bias can only be ascertained by deciphering the mechanistic of predictions. Due to generalizability issues, a randomized controlled trial at baseline followed by a prospective external validity is advised despite each expenditure.

  4. 4.

    Trustworthy AI. The evolving and intricate term “trustworthy” constitutes, at its core, concerns surrounding AI-ML use, such as robustness, generalization, interpretability/explainability, fairness, transparency, accountability, responsibility, privacy, ethics, safety, reproducibility, intellectual property, so forth. Unfortunately, overly optimistic findings from intelligent systems often occur at the expense of degradation in their trustworthiness. Fostering research is required in, e.g. (i) representing global ethnic, socioeconomic, and demographic groups to mitigate equity challenges that might impact existing marginalized and underrepresented groups; (ii) reporting, data sharing, metrics, reproducibility, and traceability; and (iii) the algorithmic “black box” as a threat to core principles of procedural fairness. It is also imperative to establish international technology-specific and disease-specific guidelines for gynecologic AI.

  5. 5.

    Human-in-the-Loop AI. While it is imperative to acknowledge the value of AI in modern clinical practice, it is equally crucial to uphold the bedrock of patient-centric tenets and clinical experience in all AI implications. A decent use involves it as a complement to physicians, not as a substitute. AI, as ’a tool, not a crutch’, should only be integrated if it satisfies the following three criteria: (i) respecting the ethical conduct and credo of clinical practice, (ii) serving solid clinical ends, and (iii) enhancing human interaction. Unfortunately, many research endeavors fall short of satisfaction, with the ’human-in-the-loop’ dimension notably absent yet quintessential component.

7 Conclusion and perspectives

The current SLR examines the relevance of ML to GYN oncology. From the 2807 potential records obtained from PubMed, IEEE Xplore, ScienceDirect, Springer Link, and Google Scholar, 181 papers published over the past decade were assessed in depth. We found that the use of ML in GYN oncology soared markedly in 2019—most notably regarding cervical neoplasms; the bulk of articles were published in journals. Most research focused on cervical cancer as it is a paradigm of global health inequity, with higher morbimortality rates than both uterine and ovarian malignancies combined. Ovarian cancer comes second by cause of lethality, then endometrial cancer since it represents a significant global health disparity in developed countries. However, few articles leveraged machine learning to tackle other types of GYN cancers entailing more research. The majority were solution proposals, all of which were empirically evaluated. However, a paucity of validation research is noticed. Most of the selected articles were empirically evaluated, with more than half referring to case studies but only a few are based on surveys. Clinical records gathered great interest, followed by Pap smears, cervicograms, liquid-based cytology, then ’biopsy, D &C, or hysterectomy’ WSI. An evolving use despite the challenges encountered in medical data. The most investigated task was the diagnosis. Minimal invasive sampling has been shown through ML promising results, however, the potential value of screening among the asymptomatic must also be evaluated. Most of the selected articles investigated the classification task. Neural networks (NNs) were prevalent and devoted to respective classification and regression. Articles employed a variety of performance metrics, with accuracy, sensitivity, specificity, and AUROC being the most popular combination. Cross-validation was used extensively in all validation schemes.

We must acknowledge the present SLR limitations. First, although we rigorously broadened the databases and search terms, it was still probable that some research was omitted. Second, as the extent is limited to English material published within the past decade, we might have missed prior research to this time (or conducted in a different language). Third, the undertaken quality assessment could have screened out valuable articles providing differing findings. Fourth, we were unable to conduct a quantitative analysis of the evidence as it fell outside this SLR scope due to the heterogeneity in our selection study. We also could not thoroughly analyze the bias risks or integrity of the reported articles due to related gaps.

Future perspectives include: (i) Signs of ovarian and endometrial cancer are hazy and commonly present in benign conditions, resulting in usually being diagnosed in late stages when survival rates are poor. Due to the rapid growth of DM-ML in cancer prediction, non-invasive and low-cost screening systems from several biomarkers sources have garnered a research interest as a promising form of triage. A critical survey of DM utility in early-stage OC/EC among the asymptomatic may provide impetus to research, providing also opportunities for drug discovery. (ii) Deep learning has undeniably exerted a substantial influence across various scientific domains in recent years. This has been evidenced in practices where approaches surpassed conventional methods, yielded commendable outcomes in previously unattainable tasks through automation, and even excelled in applications surpassing human capabilities. Now, GYN oncology is where DL progress has started to exhibit immense potential. While the promise is evident, concerns arise about the practical usability of the research output, often confined to ideal scenarios as real-life examples have recently demonstrated fallibility, raising the common question: ‘How can we place trust in neural networks to avoid errors and biases when we lack understanding of their decision-making processes?’. Most of the evidence in the current SLR concentrated on a global technicality. It would be interesting to delve in-depth into the actual evidence and what the future holds for the field of GYN deep learning. (iii) Assessing the performance of ML/DL models is imperative to evaluate the predictive ability across various populations and settings and to determine further improvements through limitations relevant to reported studies. As a future scope, we aim to systematically review, with meta-analysis, the predictive performance of the models used in this SLR study. Such will be based on the existing evidence on a particular medical task, prediction task, and GYN cancer. Based on a similar strategy adopted by Wadghiri et al. (2022), we aim to provide guidance on which performance metrics could be extracted from the selection study considering each evaluation and optimal configuration, associated statistical measures, why they are important, and how to handle circumstances when missing or poorly reported.