Introduction

In the 1950’s J McCarthy in Stanford University and A Turing in Cambridge University proposed the concept of machine simulation of human learning and intelligence [1, 2]. Being keen mathematicians, they advanced the basic mathematical logic into programming languages enabling machines to perform more complex functions. E Shortliffe advanced those systems to develop MYCIN, which successfully simulated the reasoning of a human microbiologist in diagnosing and treating patients with microbial infection [3]. Their model introduced Expert Systems (ES) to the scientific literature and a ten year review by Liao et al. demonstrated their wide prevalence in the industrial fields with immense applications including health care [4]. In contrast to Liao’s review, other studies questioned their real time implementation in health care and suggested a lack of their uptake and integration in the health care systems [5]. This is despite evidence from systematic reviews demonstrating the positive impact of computer aid systems on patients’ outcome and health care [6, 7].

This study aimed to systematically review published ES in urological health care with a primary aim to demonstrate their availability, progression, testing and applications. The secondary aim was to evaluate their development life cycle against standards suggested by O’Keefe and Benbasat in their review articles on ES development [8, 9]. The later would evaluate the gap between their development and implementation in health care.

Methods

The study methodology followed the recommendations outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Fig. 1). No ethical approval was required because the type of the study waives this requirement.

Fig. 1
figure 1

PRISMA flow chart for the systematic review of articles included in the review of expert systems in urology

Search

Information sources including WEB OF SCIENCE, EMBASE, BIOSIS CITATION INDEX, SCOPUS, PUBMED, Google Scholar and MEDLINE were searched using key words in (Table 1). Articles published between 1960 and 2016 were considered and examined against the inclusion criteria. While tailoring the conducted search for each literature database, the key words were combined by “OR” in each domain, then domains were combined by “AND”.

Table 1 Keywords used for literature search

Eligibility criteria

For the primary aim, data search was conducted to yield the collected results then analyse them according to pre-planned eligibility criteria based on the system model, year of production, type and outcome of its validation, functional domain application, variables for input and output, target user and domain. This selection criteria were designed with an objective to identify expert system studies and demonstrate their prevalence, testing, and applications in clinical urology. Only articles and studies written in English were included.

Further qualitative analysis was required to meet the study secondary aim. For this, further data was gathered on credibility (user perception on the system), evaluation (system usability), validation (building the right system) and verification (building the system right) then compare against the standards reported in [8, 9].

Data filtering

The resultant reference list of each included article was checked to identify a potentially eligible item that had not been retrieved by the initial search. All retrieved articles were collated in a final reference list on a management software (Endnote, X8), then duplicate studies were removed from the list.

Upon including more than one hundred articles, the rest of the eligible articles were meticulously compared to the ones included, then excluded based on demonstrating clear similarity. This was applied to avoid expanding the size of the data without adding to the study analysis.

Results

ANN was the commonest model to be applied in Urological ES (Fig. 2). The rest of the models demonstrated diversity which is consistent with other published industrial systems [4].

Fig. 2
figure 2

Analysis of Expert Systems (ES) by models (n = 169). ANN was the most common but other systems were applied on different domain as fuzzy neural model (FNM), rule-based system (RBS), fuzzy rule based (FRB), support vector machine (SVT), Bayesian network (BN) and decision trees (DT)

Prostate cancer was the commonest domain for urological ES with most of the system focusing on cancer diagnosis. These systems were applied to various domains (Fig. 3), and they were further stratified and analysed according to their core functional application as outlines in the methodology.

Fig. 3
figure 3

Urological domains (n = 168) applied by Expert Systems (ES). Prostate cancer (CaP) was the commonest domain followed by bladder cancer (Bca) then other diseases as benign prostatic disease (BPD), pelvi ureteric junction obstruction (PUJ), urinary tract infection (UTI), renal cell cancer (RCC), vesico ureteric reflux (VU reflux)

Quantitative analysis

Decision support systems

The main objective of ES in this domain was to facilitate the clinical decision making by identifying key elements from patients clinical and laboratory examinations then refine a theoretical diagnostic or treatment strategy [10]. They can guide the expert to find the right answer [11] or take over the decision making to support the none expert as [12] or even replace both to interact with the patient directly [13].

They have supported various aspects of urological decision making such as diagnosis, investigations analysis, radiotherapy dose calculation, the delivery of behavioural treatment and therapeutic dialogues.

Domains

Urinary dysfunction (U Dys) was the commonest domain to be covered in the decision support system application (n = 9), which could be further categorised into U Dys diagnostic, investigation analysis and therapeutic systems. They have demonstrated a range of methodologies, validation, and target users (Table 2) applicable to Decision support systems in Urological domain. For instance, Keles et al. [14] designed an ES to support junior nurses in diagnosing urinary elimination dysfunction in a selected group of patients while [15, 16] systems were able to support any medical user to diagnose urinary incontinence with an accuracy reaching higher than 90%. The target user of most of these systems were predominantly medical health care workers including both experts and none experts, with exception of [13, 17] which can be directly used by patients to receive an assessment of their urinary elimination dysfunction followed by a tailored treatment plan.

Table 2 Decision support systems in urological domain

Prostate diseases were represented in 6 systems while 3 of them modelled by [10, 12, 20] for diagnosing both benign and malignant prostatic disease, namely cancer prostate (CaP).

All systems in this domain were diagnosis support system with exception of [19] which also provided treatment for benign prostatic hyperplasia (BPH) and [11] calculated the required radiotherapy dose for treating CaP.

Sexual dysfunctions were modelled in 3 systems where [21] diagnosed male sexual dysfunction with an accuracy of 89%, while [22] added a therapeutic model for the same disease with an overall accuracy of 79%. Sexpert by [23] was the third system in this category developed in 1988 and in fact the oldest ES to be identified from our search in all urological domains. Interestingly this RB system was designed to interact directly with couples suffering from sexual dysfunction where the system responds to their query with a tailored therapeutic dialogue for treating their problem.

Urinary tract infection (UTI) was diagnosed and treated by one of the hybrid fuzzy systems FNM developed by [24] with an accuracy of 86.8%.

Diagnosis prediction

In this domain, ES quantifying the probability of a clinical diagnosis with a defined margin of error. They simulate a second expert opinion and it has been suggested that their use could eliminate unnecessary invasive investigation as the application of ANN by [26] could reduce up to 68% of repeated TRUS biopsies to diagnose CaP.

Domains

Prostate cancer was the main domain for this application with 19 systems out of 20. Most of them were designed to predict organ confinement before radical surgical excision of the prostate (Tables 3, 4). The target population were patients with clinically localised CaP and their accuracy reached high estimates as in [28], where the system was able to predict 98% of the low risk group for lymph node involvement using preoperative available date (PSA, clinical stage and Gleason score).

Table 3 Diagnosis prediction application of Expert Systems (ES) in Urology
Table 4 Disease stage prediction

Chiu et al. [29] modelled a system with clinical variables for patients undergoing nuclear bone scintigraphy for predicting skeletal metastasis. The system was able to predict metastatic disease in the test group with Se 87.5%, Sp 83.3%.

None seminoma testicular cancer was the other domain in this application with the system [27] able to predict the cancer disease stage (Table 4) with accuracy reaching 87%.

Treatment outcome prediction

In this application, ES combined disease and patient related factors to estimate the success of a specific treatment or intervention. As in [30, 38, 64, 69] where the system predicted the outcome of extra corporeal shock wave (ESWL) for treating kidney stones and [74, 75] providing an estimation of cancer recurrence after radical surgical treatment of prostate cancer.

Domains

Prostate cancer was also common domain in this application (n = 23). Potter [74, 75] described 4 models developed by data acquired from patients with clinically localised CaP and had radical prostatectomy with curative intent. The variables included clinical and histological findings of the surgical specimen and they were able to predict up to 81% who did not have evidence biochemical failure (rising PSA) in their follow up. Hamid et al. [76] and Gomha [77] models were not restricted to the clinically localised CaP cohort and their study population included patients at different disease stages and on any treatment pathway. Their models included 2 experimental histological markers (tumour suppressor gene p53 and the proto-oncogene bcl-2) in their input variables and the estimated predictive accuracy of the patient response to treatment were reaching 68% and 80% (p < 0.00001) respectively.

Nephrolithiasis treatment was expressed by 6 other systems applying the treatment outcome prediction concept. Cummings et al. targeted this group in his ANN [78] where he trained his network with patients’ data treated at the emergency service of 3 centres with ureteric stones, to identify patients failing conservative management and requiring further intervention. When tested on a different set of 55 cases, the system correctly predicted 100% of the patients who passed the stone spontaneously with an overall accuracy of 76%.

Extra corporeal shockwave lithotripsy (ESWL) is one of the favourable interventions in the nephrolithiasis treatment domain. The stone here receives strong external shock waves, which can subsequently reduce it into small fragment and eliminate the need for direct instrumentation of the renal tract. Their reported success rate can only provide a generalised prediction of outcome to the individual case and ANN was capable of providing an alternative multivariate analytical tool in the 4 models developed by [30, 38, 64, 69]. They estimated high accuracy of their models (Table 5), as in [64], the system predicted 97% of the patients who were confirmed to be stone free following ESWL for treating ureteric stone.

Table 5 Treatment outcome prediction

Paediatric pelvi-ureteric junction obstruction is primarily treated conservatively unless there is any evidence of renal function compromise, recurring infection or worsening radiological findings. For the failing group, pyeloplasty is the second line of treatment and [81] developed an ANN to estimate the success rate of this procedure for each individual case by predicting the post-operative degree of hydronephrosis with a reported 100% accuracy in the small tested sample.

Vesico ureteric reflux or reflux uropathy is another paediatric disease, characterised by back flow of urine from the bladder into the ureter through incompetent Vesico ureteric functional valve. Treatment is primarily conservative as it can be a self-limiting disease or surgery to reimplantation the ureters or endoscopic injection of bulking agent at the ureteric orifices [80]. The study authors trained a neural network using 261 cases whom have received endoscopic injection and the system predicted 94% of the patients who did not benefit from the treatment [80].

Laparoscopic partial and radical nephrectomy were the domain of the [82], which was developed by multi institutional case data (age, co-morbidities, tumour size, and extension) of patients having laparoscopic partial or radical nephrectomy. The system was able to predict the length of their postoperative hospital stay with an accuracy of 72%.

Bladder cancer can be treated with complete bladder excision and [79] developed systems to predict the cure rate with an accuracy of 83%.

Recurrence and survival prediction

The ES in this domain aimed to provide individualised risk analysis tools estimating the disease specific mortality and recognising the group whom may benefit from more aggressive or adjuvant treatment.

Domains

Bladder cancer survival and recurrence prediction following radical cystectomy (RC) with curative intention was the commonest domain in this application (24 out of 26 total systems). The lymph nodal involvement is highly predictive of the recurrence and these patients are considered for adjuvant or neoadjuvant systemic chemotherapy. The node free cohort will include high-risk patients who were not identified by the conventional linear stratification system. Catto et al. developed a FNM system to identify this high risk group in the nodal free cohort by predicting the disease recurrence rate (Se 81%, Sp 85%) and their survival with a median error of 8.15 months [92]. The high-risk group identified by this model can benefit from systemic treatment post cystectomy to improve their disease related morbidity and mortality [95, 96]. The 5 years survival post cystectomy was the output of 2 other ANN with a high prediction efficacy of 77% and 90% respectively (Table 6) [97, 99].

Table 6 Recurrence and progression prediction

Renal cell cancer is primarily treated with partial or radical nephrectomy for clinically localised disease with systemic therapy for the metastatic disease. There is still a degree of uncertainty in stratifying individual disease risk in order to predict the indication and outcome of systemic therapy in the group with distant metastasis. Vukicevic et al. [98] attempted to clarify this uncertainty by training a neural network with patients’ data who had nephrectomy (partial or radical) and received systemic therapy. The mature model predicted the patients who survived the disease at 3 years with an overall accuracy of 95% (CI 0.878–0.987).

None seminoma testicular cancer 5 years recurrence was the domain of [118] ANN. The system was trained with multicentre data and in its testing phase and predicted 100% of the patients who did not suffer from disease recurrence at 5 years with an overall predictive accuracy of 94% (AUC = 87%).

Predicting research variables

In academia, testing a hypothesis for ‘factors-outcome effect’ is a popular quest and the standard statistical regression analysis tools may not be effective for data contaminated by irrelevant variables [119]. AI can provide an alternative methodology in the analysis to identify variables with high correlation to the outcome by applying machine learning as in ANN. The area under the curve (AUC) is estimated for the system predictive accuracy applying all researched variables. Those research variables can be given random values or randomised then the AUC is re estimated for comparison with the original [120]. Only variables that decreases the AUC are considered significant and the wider the discrepancy of the AUC the more significant they are (Table 7).

Table 7 Research variable prediction
Domains

Prostate cancer was a common domain in this application with a total of 15 systems analysing predictive factors for diagnosis of cancer, response to treatment and quality of life with prostatic disease. One of the hot topics in Urological cancer is discovering alternative CaP diagnostic markers since serum PSA is not sensitive for distinguishing benign from malignant disease. Stephan et al. investigated the diagnostic value of three markers in this domain: Macrophage inhibitory cytokine-1, macrophage inhibitory factor and human kallikrein 11 [108]. These were used as variables (nodes) in ANN models and compared their accuracy to the linear regression of %fPSA. They have reported that only the ANN model including all three variables was more accurate (AUC 91%, Se 90%, Sp 80%) than all other models proving his hypothesis that they are only relevant as when combined.

Similarly, another study estimated the predictive values of serum PSA precursors (-5, -7 proPSA) in diagnosing prostate cancer using and comparing the accuracy to %fPSA [107]. The -5, -7 pro PSA were only significant in the cohort with PSA between 4 and10 µg/l and did not improve the predictive accuracy when added to the %fPSA. The same author tested this hypothesis on another free PSA precursor (-2 proPSA) by developing ANN with the %p2PSA (-2 ProPSA: fPSA) among other disease variables, which have improved the system accuracy (AUC 85% from 75%) [120].

Three systems evaluated the presence of bcl-2 and p53 (tumor suppressor genes) as a predictive variable for response to prostate cancer treatment [76, 77]. Their combination was reported to be significant (Ac 85%, p < 0.00001) in [77] but [76] found that only bcl-2 is relevant in the other two models (accuracy 63–68%).

Bladder cancer diagnosis and disease progression was the second most common domain with 13 systems. Kolasa et al. [110] have modeled an ANN with three novel urine markers: urine levels of nuclear matrix protein-22, monocyte chemoattractant protein-1 and urinary intercellular adhesion molecule-1, to predict the diagnosis of bladder cancer and it succeeded in predicting all cancer free patients when the three variables were used as a group. Catto.et al. [119] developed two AI models (ANN & FNM) performing microarray analysis on genes associated with bladder cancer progression. Their models narrowed down these genes from 200 to 11 progression-associated genes out of 200 ([OR] 0.70; 95% [CI] 0.56–0.87), which were found to be more accurate than the regression analysis when compared to the specimen immunohistology results.

Kolasa et al. [110] model predicting the pre-histology diagnosis of malignancy based on urine level of novel tumour markers. Their ANN was found to be more accurate (Se 100%, Sp 75.7%) than haematuria diagnosed on urine dipstick (Se 92.6%, Sp 51.8%) and atypical urine cytology (Se 66.7%, Sp 81%).

ESWL of renal stones was the research domain of [30, 69], where they aimed at identifying significant variables correlated to the treatment outcome (stone free) and developing a predictive model. Chiu et al. [69] model did not recognise residual fragments following ESWL as a significant risk for triggering further stone growth and [30] identified these factor: positive BMI, infundibular width (IW) 5 mm, infundibular ureteropelvic angle 45% or more (IUPA), to be all predictive of lower pole stone breaking and clearance.

Benign prostatic hyperplasia was modelled in a system [114] to link the disease specific clinical and radiological factors with the disease progression in patients with mild disease (IPSS < 7) and not receiving any treatment. His ANN identified: obstructive symptoms (Oss), PSA of more than 1.5 ng/ml and transitional zone volume of more than 25 cm3, to be correlated to disease progression and can accurately predict 78% of the cohort who will need further treatment.

Urinary dysfunction diagnosis accuracy by clinical symptoms was compared to urodynamic findings in female patients with pelvic organ prolapse by [115] and both the linear regression and ANN models could not establish relation between the symptoms and urodynamic based diagnosis hence dismissing the hypothesis of only relying on clinical symptoms to reach an accurate diagnosis and replace the need for urodynamics study.

Hypogonadism (Hgon) was represented in [133] system where the diagnosis was made based on patient’s age, erectile dysfunction and depression with AUC of 70% (p < 0.01).

Image analysis

This one of the advancing applications of AI in medicine where the system either analyse the variables in the reported medical images as data input or identifies these variables through a separate image analyser without the need for expert to report the scan or images. The first category was included among other systems mentioned above as in the diagnosis prediction domain where [47] included different variables from TRUS in the system input to predict CaP diagnosis. In this domain, we focused on the other group where the images are presented to the machine in the form raw data translated by the image analyser and the system will then apply their machine learning to identify the cause effect pattern (Table 8).

Table 8 Image analysis
Domains

Prostate cancer image analysis was modelled in 10 systems to enhance diagnostic accuracy as in [126] and disease progression prediction as in [128]. The first system represented each TRUS image pixel as one variable or neuron in a pulse coupled neural network and trained their system with 212 prostate cancer images to segment prostate gland boundary with an average overlap accuracy (overlap measure = difference between PCNN boundary and the expert) of 81% for ten images [126].

The other 4 systems analysed histological images of a cohort of patients post RP with clinically localised CaP to predict the disease progression. The histological images were given coloured coding and analysed by the system that used variables as % of epithelial cell and glandular Lumina to identify the high risk group for disease recurrence with an accuracy reaching 90% [128].

LUT disease urine cytology images were analysed by 2 models in [123], which identified all patients with benign disease with an overall accuracy of 97%.

Nephrolithiasis stone biochemistry analysis can be achieved through an expert analysis of infrared spectroscopy which was simulated by [124] where the infrared spectra wavelength numbers were modelled as input variables and the system prediction accuracy of the expert analysed stone specimen had a root square mean error of 3.471.

Qualitative analysis

The same articles were considered for the qualitative analysis against the four stages (validation, verification, evaluations and credibility) reported in Okeefe industrial survey [8] and Benbasat article [9]. The completion of the four stages examined in this qualitative analysis was demonstrated by none of the included systems. There is a possibility that some of these missing stages has been performed but not published in the scientific literature.

Validation was performed by almost all the systems (166 out 169) with varying degree of study strength, bias, and limitations (Table 9). Most of the data driven systems (ANN, SVM, BN, kNN and FNM) were validated by the ROC and AUC by having a training and validation set or cross validation or applying the leave one out technique. Samli et al. enhanced the validity of their system by estimating the kappa statistics with the ROC [134].

Table 9 Qualitative assessment of urological Expert Systems

Evaluation was only performed by a small fraction of these systems (n = 6). Their evaluation was aiming at the user or the expert but rarely both. There is no evidence to support that these were performed at early stages to determine the substantiality of the system to the user.

System credibility and verification were never performed. It would be implied that the verification was performed to an extent but not reported as it is a technical part of the development.

‘System development limitation and bias evaluation’ demonstrated an overall acceptable validation methodology with valid statistical analysis. However, a few observed limitations (Table 9) were reported with the common encounter being the consideration human opinion as a gold standard (n = 9). For instance, the gold standard in diagnosing prostate cancer is tissue biopsy confirmation. The interpretation of the expert clinical diagnosis as the gold standard reference can lead to statistical errors and invalidate the study.

Discussion

Expert Systems are widely available in Urological domains, with a large range of models, applications, domains, and target users including patients, students, non-experts, experts, and researchers. The number of published systems has risen over the years but with a consistent lack of publications reporting their real time testing or healthcare implementation (Fig. 4).

Fig. 4
figure 4

Expert System (ES) analysis by year of publication showing an upward trend and increase in number of publications. Systems were included according to the keywords for expert system models and applied in urological domains

There is an increasing interest in analysing this gap which is reflected from the scope of AI historic review articles which aimed to only familiarise the readers with ES existence and application [33, 125]. In fact, the majority had a relatively narrow scope on the evolution and application of one ES models (artificial neural network) in prostate cancer diagnosis. Recently, similar to our research, there has been more interest in AI validation, and lack of uptake despite the faith in their ability. Therefore, in this study we quantified ES progression and applications in Urology while examining their developmental life cycle.

It was evident that CaP was the commonest domain in almost all applications contributing with more than two thirds of the systems (91 systems in total). Different aspects of this domain have been simulated by these systems to include diagnosis, therapeutics, predictions of disease progression or treatment outcome, researching variables and medical images analysis. Most of these systems were simulating urologist cognitive function with little guidance on their benefits and how they can be implemented to improve cancer decision making.

In industry, this is usually performed before the system development by evaluating the system usability from the user perspective. This part has lacked or not been acknowledged in the published studies and is possibly a core reason for the lack of their integration in urological health care. Furthermore, none of these systems has been a subject to live testing in a well-designed study to prove its efficacy over standard tools or in the clinical context to prove its validity to justify their complex structure to AI novice health care professionals. The qualitative analysis demonstrated that validation is the only stage of the development cycle to be applied by most of the systems and there is a lack of system evaluation, credibility, and verification. The evaluation can be subdivided into usability (usually by average user), utility and system quality (by experts) [9]. Despite the crucial stage of ES development, there has been a lack of attention in the published articles to integrate it into the development life cycle. This can mean the whole system can fail and also challenge its uptake [8].

An example can be drawn from this review where the majority of the systems focused on CaP diagnosis and treatment. Their implementation would be challenged by the standard decision-making tools of the cancer multidisciplinary team and the ethical concerns of relying on ANN in making such life changing and expensive decision. The utility analysis of those ES would have been essential for tailoring their development for real time applications where they can be more substantial to the user. One example is lack of community-based systems for the initial referral of suspected cancer patients and follow up of stable disease, where NICE have identified a need for such decision support models [152, 153].

There was a wide diversity of modelling in Urological ES with ANN being the most common model in this review. These would bypass the need for direct learning from experts and the exhaustive process of knowledge acquisition, which is a core requirement for knowledge-based systems to attest the whole system progress [55]. However, their analytical hidden layer of nodes “black box phenomenon” has been a subject for wide criticism and rejection from clinicians due to lack of transparency and understanding of its function.

Stephan et al. suggested a statistical solution to identify the variables significance by performing sensitivity analysis [154]. This estimates the variation of the AUC with introduction or elimination of each variable. This can only reflect the significance of each variable but does not explain how the cases are being solved nor quantify this to the user in a standard statistical value. This can be useful in research as they can identify significant variables in a large set data and has been successfully applied in the field of academic urology as in [119] where the system successfully identified the relevant gene signature for bladder cancer progression which saved time and cost of microarray analysis of all suspected genes.

Holzinger et al. emphasised on the importance of the explicability of the AI model specially in medicine which is a clear challenge for machine learning due to their complex reasoning [155]. Their study attempted to simplify the explanation by classifying the systems into post-hoc or ante-hoc. In post-hoc, explanations were provided for a specific decision as in model agnostic framework where the black box reasoning can be explained through transparent approximations of the mathematical models and variable [156, 157]. Those are reproduced on demand for a specific problem rather than the whole system which can shed more light on the system function. It is not certain if those can be easily interpreted by the AI novice clinician, but it has provided more explicit models for tackling the black box phenomenon.

Knowledge based systems can be explained by ante hoc models where the whole system reasoning can be represented. Those systems rely on expert knowledge in their development and face the bottle neck phenomenon in their applications. Furthermore, they are not always successful in identifying and mapping multilinear mathematical rules and machine learning is mandatory or at least more efficient [155]. Bologna and Hayashi et al. suggested that machine learning is more successful in complex problem solving with inverse relation between the machine performance, and it is built-in transparency [158].

Another common aspect lacking in these articles was the coupling of their system development methodology with the medical device registration requirements. This is essential as ES often function as standalone software with no human supervision to their calculation. This categorises the system as a medical device with mandatory perquisite to register with the relevant authorities as Medicines & Healthcare products Regulatory Agency in the UK [5].

Cabitza et al. compared AI validation to other medical interventions as drugs and emphasised on considering the “software as a medical device” [159]. Unlike other devices or drugs, AI models in healthcare are unique in being more dynamic which should be reflected in their validation cycle. They also quoted the known term “techno-vigilance” to learn from other medical device validation pathways. They recommended different outlook to validation where it is broken down to statistical (efficacy), relational (usability), pragmatic (effectiveness) and ecological (cost-effectiveness) with available standards for those steps (ISO 5725, ISO 9241 and ISO 14155). The latter is viewed as a novel standard for evaluating the cost benefits of applying specific AI model in healthcare which would require longitudinal modelling of health economics [159]. This was evidently lacking in articles that were included in our review and in fact most of the studies were non-randomised and retrospective.

Similarly, Nagendran et al. systematically analysed studies that compare AI performance to experts in classifying medical imaging into diseased and non-diseased, they concluded that AI performance was non-inferior to human experts with potential for out-performing [160]. Their 10 years review identified from literature 2 randomised clinical trials and 9 prospective non-randomised trials extracted from a total of 10 and 81 studies, respectively. Their review assessed the risk of bias using PROBAST (prediction model risk of bias assessment tool) criteria for non-randomised studies. The tool is designed for identifying the risk of bias by analysing four domains (participant, predictors, outcome, and analysis) [161], which is applicable to systematic review analysing prediction model with a target outcome.

In our study, as there was no unified outcome for the included prediction tools, the scope was on the role of validation rather than the outcome. Therefore, those tools assessing the risk of bias were not utilised due to the wide gaps in the tool checklist between the included articles. Such study design and data heterogeneities were also evident in Nagendran et al. and similar to our study, data synthesis was not possible. This will pose a challenge reinforcing the application of AI models in healthcare due to lack of level 1 evidence which is mandatory in healthcare for accepting a novel intervention.

Finally, the quality of the data analysis was beyond the scope of our systematic review despite being essential for developing quality AI systems. Cabitza et al. examined this gap and focused on the data governance [161]. There has been very limited evidence on data quality appraisal and standards with call for further research and allocation of more resources specially in healthcare where the data are notoriously limited with errors or discordance.

The potential application of AI in urology with focus on its future application has been recently discussed by Eminaga et al. [162]. They have shown an increasing interest in urology research, but with a challenged mechanistic update due to the model complexity and lack of end user understanding of its design and function. Furthermore, they identified discrepancy between AI engineering and clinical application which reflects some lack of communication between both disciplines.

This can be either a consequence or a cause for lack of clinical utility testing, which increases the need for research in this domain to be incorporated in the software development [163]. In fact, it has been recommended to perform the utility test before developing the system to tailor its application [164, 165]. Despite having different methodology to our systematic review, the recommendations were similar with strong emphasis on the lack of utility testing and its impact on AI uptake in healthcare [166,167,168].

Conclusion

ES have been advancing in Urology with demonstrated versatility and efficacy. They have suffered from lack of formality in their development, testing and methodology for registration, which has limited their uptake. Future research is recommended in identifying criteria for successful functional domain applications, knowledge engineering and integrating the system development with the registration requirement for their future implementation in the health care systems.