Artificial intelligence in disease diagnostics: A critical review and classification on the current state of research guiding future direction

The diagnosis of diseases is decisive for planning proper treatment and ensuring the well-being of patients. Human error hinders accurate diagnostics, as interpreting medical information is a complex and cognitively challenging task. The application of artificial intelligence (AI) can improve the level of diagnostic accuracy and efficiency. While the current literature has examined various approaches to diagnosing various diseases, an overview of fields in which AI has been applied, including their performance aiming to identify emergent digitalized healthcare services, has not yet been adequately realized in extant research. By conducting a critical review, we portray the AI landscape in diagnostics and provide a snapshot to guide future research. This paper extends academia by proposing a research agenda. Practitioners understand the extent to which AI improves diagnostics and how healthcare benefits from it. However, several issues need to be addressed before successful application of AI in disease diagnostics can be achieved.


Introduction
The application of artificial intelligence (AI) provides advantages pertaining to the diagnosis of diseases. The healthcare system is a dynamic and changing environment [1] and medical specialists continually face new challenges with changing responsibilities and frequent interruptions [2,3]. This variety regularly leads to the diagnosis of disease becoming a side issue for healthcare experts. In addition, the clinical interpretation of medical information is a cognitively challenging task [4]. This not only applies to experienced professionals but also to actors with different or little expertise such as young assistant doctors [4,5]. Medical specialists' available time is usually limited [2,3] and diseases might evolve and patient dynamics change over time [6,7] making diagnostics a highly complex process [8,9]. However, an accurate diagnostic process is of central relevance to insure timely treatment and, thus, to achieve safe and effective patient care [10,11].
The importance of AI as a component of the diagnostic process has been steadily increasing since building systems became more practical [12,13]. There is ongoing enthusiasm for and hype about AI [14][15][16], and both researchers and practitioners focus equally on this technology from multiple perspectives [17][18][19]35]. There is no uniform definition for the term AI [20], but considered as "the ability of a machine to perform cognitive functions that we associate with human minds, such as perceiving, reasoning, learning, interacting with the environment, problem solving, decision-making, and even demonstrating creativity" [21]. AI is generally associated with human-like behavior [22] and covers a wide range of research areas, such as natural language processing or robotics [23]. However, current practical applications, 1 3 including healthcare and disease diagnostics, are narrowed down to a specific task [24], and are being developed using machine learning [12,13]. Algorithms exploit medical data to generate predictions [22] and continuously learn and develop over time by constantly processing new and updated data [25]. Algorithms acquire information through different types of knowledge and input or over multiple years of experience [25]. Therefore, AI empowered systems are able to process more knowledge compared with humans [223], possibly outperforming them for certain medical tasks.
The application of AI within the diagnostic process supporting medical specialists could be of great value for the healthcare sector and the overall patients' well-being. The integration of AI into existing technical infrastructure accelerates the identification of relevant medical data from multiple sources which are tailored to the needs of the patient and the treatment process [26][27][28][29]. Simultaneously, AI unchains silo thinking, such as sharing knowledge across departmental boundaries [30], as information from all involved areas is taken into account. Furthermore, AI generates results based on a larger population rather than on subjective, personal experiences [31] and achieves equal results when using identical medical data and does not rely upon situations, emotions, or time of day [32].
Despite the potential shown by AI as a component of medical diagnosis, to date, no thorough analysis of approaches has been conducted, and a comprehensive conceptualization of algorithms that have been previously applied in the diagnostic process for various diseases remains lacking. Recent studies have examined the application of AI in healthcare in general [33][34][35] or in specific clinical domains [36]. For example, Jiang et el. [33] focused on classical AI algorithms such as support vector machines and neural networks within the entire healthcare sector. Another study by Rauschert et al. [36] dealt with AI in clinical epigenetics whereby the authors investigated individual treatment characteristics of patients based on genetic and epigenetic profiles. In contrast, our research concentrates on the diagnostic process and to what extent AI is being integrated. Specifically, we aim to provide a classification of the current state of AI in disease diagnostics and guide future research directions. Our study uncovers which algorithms are particularly suitable for which disease diagnosis and where other approaches may be more advisable. By pointing out explicit performance measurements, we explain the overall suitability of existing strategies. We argue that this is of great interest to researchers and practitioners since the relevance of AI within the diagnostic process will continue to increase. There is an urgent demand for academia to provide an overview of the AI-based approaches that are being applied to understand the intricacy of this ever-evolving research area. To address these pressing issues, our study is guided by the following research question: RQ: How is artificial intelligence currently applied in healthcare to facilitate the diagnostic process, and what are the pertinent future research directions?
We have followed the latest methodological guidelines for literature reviews [37,38] and conducted a critical review (CR) analyzing existing literature on a broad topic to reveal weaknesses, inconsistencies, problems, or discrepancies [39]. We initially reviewed the status quo based on the existing literature on AI in medical diagnostics, for which we focused on research articles that dealt with practical applications. This has guided us in portraying the current AI landscape in the diagnostic process, including areas that have not yet been adequately covered. By conducting a critical assessment, we are able to provide scholars with knowledge about future research directions and improvements [38].
This paper structures our understanding of AI algorithms that have been researched as components of medical diagnostics. Researchers will find the overview of application fields helpful when considering unexplored areas where the deployment of AI for healthcare services could be beneficial and future research might be necessary. Practitioners will be better able to understand the extent to which AI improves the diagnostic process and how healthcare specialists and patients, as well as the overall treatment processes, benefit from it. Society will comprehend how AI is likely to be used within diagnostics including its current deliberations. We believe this study will be valuable to drive related research and practical implementations as well as lead to a constructive debate on how AI applications can be further integrated into the diagnostic process to enhance overall patient outcomes.

Related work
The fundamental goal of the diagnosis of disease lies in determining whether a patient is affected by a disease or not [40]. Diagnosis can be seen as a process or classification or as a "pre-existing set of categories agreed upon by the medical profession to designate a specific condition" [41]. The entire diagnostic process, in general, is a sophisticated [8] and "patient-centered, collaborative activity that involves information gathering and clinical reasoning with the goal of determining a patient's health problem" [10]. The Committee on Diagnostic Error in Health Care developed a conceptual model to comprehend the diagnostic process (cf. Figure 1).
Initially, the patient experiences a health problem associated with the individual's symptoms, which causes the person to contact the healthcare system, where sufficient information is collected via reviewing the patient's clinical history and conducting an interview, performing a physical exam and diagnostic testing, and referral and consultation that involves other medical experts. Information gathering, integration, and interpretation, as well as providing a working diagnosis-for example, a single or differential diagnosistogether represent a continuous process that can be repeated several times. The working diagnosis and an explanation of it are shared with the patient, and appropriate treatment is planned. Finally, this process results in an outcome for patients and the healthcare system, such as learning from errors or a timely diagnosis.
This highly complex process has, on the one hand, great potential for errors and, on the other hand, may vary according to the medical discipline. A diagnosis is usually based on individual experiences of clinicians and may differ depending on the emotional and mental state of the clinician [42]. Medical specialists further have a broad spectrum of duties [2,3], the time they can devote to determining a diagnosis is usually limited. Moreover, since several experts from multiple medical disciplines are needed for a diagnosis [43,44], the process might be extended even though timeliness is critical when considering further treatment plans [10].
The unique challenges of modern medical practice and the wide variety of diagnostic strategies can create difficulties for the entire healthcare system, and thus, more systems are relying on information technology (IT) [45]. AI is already utilized to assist medical experts and improve the diagnosis of diseases, for example, in early diagnosis of ectopic pregnancies and helping gynecologists with their decisions about initial treatment [46]. This progress is due to improvements of technical capabilities such as processor speed and storage capacity, while costs have dropped [47]. Furthermore, the amount of medical data is continuously growing, and systems are now capable of identifying, extracting, and processing information from different sources more quickly than previously [48,49]. In addition, the technology and algorithms needed to implement systems are now widely accessible and can be easily applied [50] and are commonly classified in supervised, unsupervised, and deep learning [36,51].
Supervised learning describes algorithms that learn associations based on existing samples or training data [52,53]. Labels of a dataset are known, for example, as images of fractures or ruptures that have been classified by medical specialists. This information is used to train algorithms that generate predictions for unused data [22,54]. Supervised learning depends on a given user input and is 3, thus sensitive to data quality [22]. Poorly or incorrectly labeled data lead to faulty predictions, and trained algorithms might even be biased. Supervised learning is one of the most used AI approaches, as it provides robust classifications [36]. Common examples include logistic regression or neural networks. Supervised learning approaches have been found to aid in the diagnosis of dementia [55] and cancer [56].
Unsupervised learning covers self-organizing algorithms that learn associations without existing samples or training data [57,58]. It is well suited for identifying correlations within a dataset but are unable to determine their statistical relevance [59]. Moreover, unsupervised learning might also identify irrelevant clusters if these are grounded in the data. This approach does not necessarily depend on user input but demands verification of the plausibility and salience of the identified clusters [22,60]. Common examples include clustering dimensionality reduction algorithms. Unsupervised learning is used, for example, for hepatitis disease diagnostics [4].
Finally, deep learning algorithms learn correlations among data via evolutionary tests that continually adjust predictions according to the given data [61,62]. Deep learning might be considered an individual AI approach. However, it must be viewed from various angles due to its capability of combining supervised and unsupervised approaches [36]. It is particularly suitable for handling large datasets with multiple dimensions and input sources [61,63], but the outcome is inexplainable due to its sophisticated structure and thus represents a black box for users, eventually posing a high risk for patients' wellbeing [64,65]. Common examples include recurrent neural networks and convolutional neural networks. Examples of deep learning in diagnostics are the classification of dermatological diseases [66] and atrial fibrillation detection [67].

3
The development of supervised, unsupervised, and deep learning algorithms presupposes that the available data can be split into training and test sets [33,36]. As part of training, the algorithm is optimized by tweaking different parameters and attributes of the data. Within testing, the performance of the algorithm is validated. Both sets usually use a subset of the same dataset. Their separation does not follow any specific rules, but the training set is normally larger than the test set. An increasingly popular alternative to purposefully separating the data is the use of cross-validation, where the data is randomly split into multiple training and test sets to approximate the external validity of the algorithm [68]. Finally, the algorithm and the generated model are validated using unrelated, unseen data, for example, genuine medical data.
Various performance measurements are applied to evaluate the efficiency of a trained model. The most common measurements found in scientific publications are accuracy, sensitivity, and specificity [69,70]. Accuracy estimates the correct classification out of all classifications on a range from 0-1 (the equivalent of 0-100%). Sensitivity explains how many patients with a disease have been correctly identified with this disease (true positive rate), on a range from 0-1 (0-100%). Specificity, in contrast to sensitivity, determines how many patients without a disease have been correctly identified without this disease (true negative rate), on a range from 0-1 (0-100%) [36]. Thus, higher values for accuracy, sensitivity, and specificity indicate a well-trained model that provides the most accurate results.

Research approach
The objective of this research lies in identifying in which areas of the diagnostic process AI has already been applied and how these approaches are to be evaluated. We performed a CR to identify relevant literature and analyze it according to the AI algorithms, their medical applications, dataset characteristics, and performance measurements. Since we were interested in providing a holistic picture, we decided to start our literature search within the information systems (IS) discipline. We argue that IS, as interdisciplinary community addressing dynamic and evolving sociotechnical challenges, suit as a valid starting point for guiding medical specialists. Within the individual steps of the research procedure, we expanded our search to other scientific domains and outlets. For example, we identified publications from healthcare (e.g., Annals of Emergency Medicine) and biomedical engineering (e.g., Computers in Biology and Medicine).
As CRs are frequently performed in an unsystematic manner [39], but benefit from an informative explanation of how the literature was retrieved [37], we adopted a systematic approach in our research procedure. This insured that we would avoid having a subjective aggregation of existing literature [39] but instead make a significant contribution by searching for the most relevant literature using keywords in scientific databases [71,72]. We used a descriptive review focus since an interpretable pattern from the existing literature would thereby be generated [73], leading to a depiction of the current state of our research domain [74]. The research procedure, as illustrated in Fig. 2, considers articles up until December 2020.
The literature search was conducted with the help of litbaskets.io, a novel IT artifact specifically designed to assist researchers in retrieving relevant literature from credible scientific sources [75]. Unlike in other areas, for example in medicine or psychology, there is no uniform database including the broad spectrum of outlets for conducting comprehensive literature searches in the IS discipline. Litbaskets assists scholar in the concise and precise selection of relevant publications. Technically, a search string is created which is used by Scopus's advanced search, making it possible to seek across indexed scientific sources [75]. Thus, litbaskets does not collect articles itself, but uses the ISSN numbers of selected outlets to grasp articles according to the search term. We deliberately chose 154 essential IS journals because we wanted to focus on substantial and high-quality articles. 1 The literature search considered all peer-reviewed articles; less relevant sources, such as editorials, were excluded. We carried out a full text and metadata search; we deliberately did not limit it to metadata only, as it cannot be guaranteed that the search term is always contained in the metadata, thus possibly resulting in overlooking relevant publications. We used the following query for our full text search: (AI or "artificial intelligence") and (diagnostic or diagnostics) We purposely chose a rather broad term pursuing a central concept of CRs: analyzing a research topic using a wider scope compared to other procedures [39], aiming to reveal "weaknesses, contradictions, controversies, or inconsistencies" [38]. We used the terms AI and artificial intelligence as well as the plural spelling of diagnostic covering the focus of our research. We did not use, for example, certain names of machine learning algorithms to avoid needlessly limiting our search. For concentrating on retrieving literature of AI in disease diagnostics while bypassing narrowing down our initial search, we further did not include related terms to diagnostic, such as diagnosis or diagnose. This ensures a thorough analysis of existing approaches as a broad term is crucial for providing a comprehensive overview yielding in an agenda guiding future directions.
After retrieving the literature, we carefully read the title, abstract, and keywords of each publication to determine its relevance to our research question. We focused on papers that specifically considered and applied AI in the context of diagnostics. Our guiding questions were: Which AI methods were applied to improve the diagnostic process? What algorithm is used for which diseases, and how can medical specialists be assisted in the diagnosis of diseases? How have algorithms been developed for disease diagnostics? How accurate are the results of the individual applications? Publications that compared algorithms under theoretical conditions or those that did not disclose the applied algorithm were not considered in our research.
Since a basic search is not able to provide a comprehensive review, and we were highly interested in retrieving relevant literature outside the IS discipline, we further conducted a backward search to identify additional literature. We collected all references listed in the bibliographies of all the papers from the initial search, in which we only included references to other scientific publications, such as web pages, panel discussions, or business reports which did not fit our research goals. The relevance of the papers was determined via the same approach we used within the initial search: we read the title, abstract, and keywords, which was followed by categorization according to theoretical foundations. The final step included a forward search to further identify relevant publications, whereby we acknowledged all papers that were retrieved in the initial and backward searches. We were interested in articles that had been cited by other researchers after their initial publication. Once again, we read the title, abstract, and keywords and performed the categorization process as outlined above. We deliberately did not decide adding keywords for filtering or excluding, for example, theory driven research articles to prevent overlooking relevant literature. Even though this manual process was very time consuming, it simultaneously ensured gathering the largest possible number of relevant publications.
Following the recommendation of Bem [76] for a conceptual structuring of the research topic, the categories of the examined disease and applied learning type served as a conceptual pre-structure whereby the literature was roughly categorized. Afterwards, we examined and compared the retrieved literature to identify correlations and similarities. Prior research was then grouped based on its theoretical foundations and finally reviewed concerning quantitative criteria. We initially clustered the publications according to the learning types, using supervised and deep learning, as these approaches provide more clinical results and are thus more likely to be found in healthcare [33]. A cross-check further revealed only one paper that dealt with unsupervised learning. Following Jiang et al. [33], we then used the most common machine learning algorithms applied to develop AI applications [12,13,22] as classifications within supervised and deep learning. This resulted in 12 clusters: neural network, support vector machine, nearest neighbor, random forest, decision tree, logistic regression, naive Bayes, discriminant analysis, convolutional neural network, deep neural network, recurrent neural network, and others. We further clustered the retrieved articles according to their examined disease. Since this approach was rather granular, we assigned the disease to the affected organic system (i.e., cardiovascular, dermatological, gastrointestinal, infectious, metabolic, neurological/psychiatric, pediatric, pulmonary, and urogenital). The resulting classification according to the algorithm combined with the organic system helped us to interpret the results in a more holistic way. Finally, we analyzed the dataset characteristics and performance measurements of the algorithms. The datasets were described by their origin, sample size, number of features (e.g., patient characteristics, such as age or smoker), training, and testing sample. The performance was assessed using accuracy, sensitivity, and specificity. The quality criteria aided us in determining the value of algorithms applied for disease diagnostics as well as 1 3 critically examining whether developments and applications seemed reasonable. Table 1 exemplifies how we performed the categorization by providing example articles and their classification. Please note that one paper might have dealt with multiple algorithms. Table 2 and 3 explain the organic systems and algorithms used for clustering.

Results and analysis
The execution of the CR resulted in 126 relevant articles, of which 29 were retrieved via the initial search, 17 via the backward search, and, finally, 80 via the forward search. We analyzed the articles according to their distribution by year, publication outlet, distribution by category, and, finally, dataset characteristics and performance measurements. We thereby obtained an overview of the AI-based approaches examined in the literature, including their suitability for the diagnostic process, and critically evaluated these results.
There is no prototype or implementation of a developed algorithm or model that is actually used in healthcare for the purpose of diagnosing diseases or assisting within the diagnostic process. There have been limited studies that presented user interfaces as spin-off product that theoretically provides a basis for its deployment in disease diagnostics. For example, Ogah et al. [78] developed a knowledge-based system including a user interface for diagnosing hepatitis B via a neural network. However, the study focused on the development of the algorithm and underlying database rather than proving a suitable graphical interface for users. Furthermore, most studies have employed textual medical data to implement suitable algorithms for disease diagnostics. Mishra et al. [66] curated a database with 4,700 images of nine common dermatological diseases, such as acne, erythema, or wheal, and used deep learning based for correct classification. Another example is captured transvaginal ultrasounds of pregnant women to detect ectopic pregnancies [46]. Nevertheless, these studies are rather the exception; most approaches have developed a specific algorithm to diagnose a particular disease. A limited fraction of research combines algorithms to achieve better results. For example, decision trees and case-based reasoning were integrated into an intelligent model for liver disease diagnosis, indicating considerable accuracy compared with single method concepts [79]. Likewise, a small number of studies implemented and compared multiple algorithms and considered which one best fit the diagnosis of a disease. Studies performed predictive modeling via multiple algorithms, that is, logistic regression, random forest, decision tree and support vector machine, for the early detection of Parkinson's disease [80]. Lu et al. [56] compared decision tree, logistic regression, nearest neighbor, neural network, and support vector machine to assist in cervical cancer diagnosis. Although there are nascent approaches to compare algorithms, there is no scientific publication that has yet investigated whether one approach is suitable for the diagnosis of several diseases.

Distribution of articles by year
No articles were published before 1990, but articles related to AI in diagnostics have increased substantially over time (cf. Figure 3). Since 2013, the number of publications has risen considerably, which is in accord with technological improvements of AI. The increasing number of studies is a product of the enhanced and ever-growing technical capabilities and the quantity of medical data [47,50]. The constant growth of publications reflects not only the increased demand for AI within the disease diagnostics but also represents the salience and legitimacy of this research area.

Distribution of articles by outlet
From a total of 126 articles, 105 (83.3%) were in journals, and only 19 (16.7%) were conference publications. Most articles (52, 41.3%) came from 12 outlets and were published in practice-oriented journals. This is understandable since we explicitly searched for articles that examined AI within the diagnostic process. The highest number of journal articles were published in Expert Systems with Applications (15, 11.9%). Furthermore, a large portion of publications were from outside the IS discipline; this was acceptable and suspectable as our focus was clearly on the applications in healthcare. Thus, most practical applications of AI in disease diagnostics were to be found in medical outlets. Table 4 outlines the distribution of articles by outlets with more than one publication.

Distribution of articles by category
Among the organic systems, cardiovascular disorders clearly stood out as the most heavily researched area, with 34 articles  Table 5 depicts the number of artciles in each category. For validating if a research domain has received much attention and whether certain areas are highly correlated, we compared the organic systems and algorithms the papers have dealt with using a research assignment matrix (cf. Table 2). We matched the corresponding organic systems with each algorithm. The illustration aided us in examining and allocating the results of the large number of relevant publications. The matrix provides an overview of the status quo-areas with less research do not indicate that further studies are inevitably required there. However, the matrix highlights that transferring previously gained knowledge of the application of AI within diagnostics to other research areas seems possible and advisable. It appears that most of the algorithms were already being used in the diagnosis of neurological/psychiatric diseases (10), followed by cardiovascular and gastrointestinal disorders (7), and urogenital diseases (7). Dermatological, infectious, metabolic, and pulmonary diseases (3) were the least examined. In terms of the algorithms, neural networks (8) and support vector machines (8) were the most researched. Both methods have been applied to the diagnosis of diseases for nearly every Table 2 Explanation of organic systems used for clustering Organic system Description Cancer Cancer begins with an abnormal cell that develops over time into a mass of cells and can then metastasize, thus spreading to other locations in the body [130]. Depending on the type of cancer, the treatment is frequently complicated, and an illness might lead to death. There are more than 200 different types of cancer, all of which are treated differently [130]. Well-known examples are breast [131], cervical [132], or liver [133] cancer Cardiovascular Cardiovascular disorders include diseases of the heart or blood vessels as well as vascular diseases of the brain [134]. It is known as the leading cause of death and disability in the world, with more than 17 million deaths per year [135]. Specific examples are acute myocardial infarction [116], coronary artery disease [136], or atrial fibrillation [67] Dermatological Dermatological examinations deal with normal and abnormal skin and include its associated structures, e.g., hair, nails, and oral and genital mucous membranes [137]. Skin diseases are very common: almost one-third of all humans are affected by them in the course of their lives [137]. Examples are erythemato-squamous disease [138] or psoriasis [96]

Dataset characteristics and performance measures
The examined literature varied considerably in its depth of detail and presentation of the data used as well as the results achieved. Table 6 outlines example dataset characteristics and performance measures. ACC = Accuracy, SEN = Sensitivity, SPE = Specificity, SP = Sample, FR = Features, TR = Training, TE = Testing, ? = no information, cv = cross validation.
We distinguished the origin of datasets as self-retrieved information (37), using an existing database (51), medical data grounded on other studies (7), and not providing any details (31). The results showed that most of the studies did not reveal detailed information on the data origin or how the data were collected, including circumstances or contexts. However, an in-depth presentation and explanation of the origin of medical data used to develop algorithms for disease diagnostics is a vital component for interpreting outcomes and verifying whether findings are adaptable and generalizable. A positive example was presented by [46] who used clinical data based on a long-term study from the Department of Obstetrics and Gynecology of the University Hospital "Virgen de la Arrixaca" in the Murcia region of Spain from November 2010 to September 2015. A total of Organic system Description

Gastrointestinal
Gastrointestinal disorders involve disease of the digestive system, the most important connection between absorbed nutrients and the human body [139], and often occur as a result of abnormal behavior of the gastrointestinal tract [140]. These might be chronic disorders, such as chronic kidney disease [106], but also include lesser-known illnesses such as cirrhosis [141] or celiac disease [121] Infectious Infectious diseases, commonly known as transmissible diseases, describe clinically evident illnesses where an organism is capable of entering, surviving, and multiplying in a human host [142]. Typical transmission paths include physical contact, contaminated food, or body fluids [142]. The different forms of hepatitis are among the most commonly referred to infectious diseases [4,102]. Tuberculosis is also classified as an infectious disease [143] Metabolic Metabolic disorders might occur when the normal metabolic processes of the human body are altered by abnormal chemical reactions [144]. Disorders of the metabolism are accompanied by dynamic changes but are usually well treatable [144]. Different types of diabetes [145] are associated with it, but this category also includes diseases such as osteoporosis [127] or thyroid disease [146] Neurological/psychiatric The neurological/psychiatric (neuropsychiatry) category comprises organically conditioned cognitive and mental disorders and is thus overlaps with the medical research fields of psychiatry, neurology, and psychology [147]. The classic neuropsychiatric diseases with symptoms in both the neurological and psychiatric fields are Parkinson's [148], dementia [55], and autism [149] Pediatric Pediatric diseases are disorders that explicitly occur during childhood. Research has shown that the treatment of childhood diseases differs significantly from those contracted by adults [150], justifying physicians caring for children holistically. Common examples of such diseases are neonatal sepsis [109] or abdominal pain [151] Pulmonary Pulmonary diseases are related to problems with the lungs [152]. Patients often experience breathing problems, shortness of breath, or coughing [153]. In addition to chronic obstructive pulmonary disease [153], asthma [112] is also a common lung disease Urogenital Urogenital disorders are problems that directly affect the urinary and genital tracts [150]; they are further differentiated according to whether a person is only temporarily affected or for life. Urogenital disorders comprise, for example, urinary tract infections [154], but ectopic pregnancy [46] is to be included as well consequences. However, one might argue that researchers are more familiar with self-retrieved data leading to better results in terms of algorithms' efficiencies. Existing data might be used with the objective of testing a developed algorithm without really aiming to diagnose a disease, instead just fulfilling a purpose. The quantitative details of the datasets have also shown considerable discrepancies among the studies. Other than a few articles (9), the majority outlined the exact sample size of the dataset; however, the number varies considerably, ranging from 9 to 212,554. Studies have rarely used a  [155]. A decision tree always consists of a root node and several inner nodes and at least two leaves. Each node represents a logical rule and each leaf an answer to the problem Logistic Regression Logistic regression algorithms aim at finding relationships between variables, which are refined in multiple iterations to predict an output value (or multiple values) based on given input features [156]. The output might be a finite number of states Naive Bayes Naive Bayes is based on the Bayes theorem and represents a simple graphical model (directed acyclic graph) for the determination of relationships among various features [155]. The goal is to present the most compact probability distribution of involved variables by using known conditional independence Nearest Neighbor Nearest neighbor is used for assessing the probability of classification (class membership) or regression (property value) [157]. Recurrent neural network is -like convolutional neural network, useful for but data with multiple space [61] but uses output of one neuron used as input for another or itself feedback loops, i.e., the output of a neuron suits as input for another neuron or even itself Other Others There are other, less-known approaches, such as dimensionality reduction or regularization algorithms. However, in the context of this study, these play an insignificant role, which is why no detailed description is provided at this point large dataset (N > 1,000), and we found only a subset of nine publications which considered large samples. For example, [67] analyzed atrial fibrillation using a convolutional neural network based on 150,060 samples. In contrast, [89] examined 53 patients to diagnose osteoporosis. Small sample sizes hinder the possibility for a generalization to a larger population; thus, results must be interpreted with caution.
There is no "one size fits all" answer to the requirement for a minimum sample size. However, larger samples are favorable for achieving better results [90] especially since small datasets frequently lack in detecting certain patterns [91][92][93]. Furthermore, small sample size produce bias even if there are above 1000 records used [94]. The identified training and testing samples first showed that the data were usually split according to certain patterns (e.g., 60/40, 70/30, 80/20). However, it is remarkable that only a small number of studies (26) used cross-validation to randomly split the data to approximate the external validity of the algorithm and thus generate better outcomes [68]. Furthermore, the algorithms are typically trained and tested via a subset of the dataset and afterwards validated using the same entire dataset again. This is a common approach, nevertheless, it seems arbitrary as the results are analyzed in isolation without the inclusion of separate and previously unknown medical data.
Similar to the characteristics of the datasets, the reporting of the performance measures was rather patchy. Sixty-six out of the 126 studies reported detailed information on the three measures. The remaining publications were missing at least one value, whereby accuracy was usually used to assess the efficiency of the algorithm. However, reporting accuracy as a single performance measurement might be misleading [95].
The results of the studies were mostly quite promising; the overall accuracy ranged from 61. The evaluation of algorithms in disease diagnostics without knowing detailed information about the used dataset and achieved efficiencies prevents the generalization of results and transferring them into a different context. The need for comprehensive information is crucial in understanding studies justification and contribution to theory and practice. There is frequently too much information missing for undertaking an adequate evaluation.

Discussion and future directions
We critically reviewed identified publications according to both strengths and weaknesses. Following the recommendations for CRs [38], we set out to determine directions for future research. We categorized our findings under a suitable heading that deals with identical or similar issues. Table 7 and 8 summarizes exemplary future research questions according to the four identified areas.

Advancements and explicability
Recent studies have been primarily concerned with the development of a specific algorithm for the detection of a particular disease [87,124]. Moreover, some algorithms have been more researched than others. Future research should examine possible advancements by combining the various existing algorithms to achieve better results compared with the isolated observation of a single algorithm [97,98]. We strongly recommend more research in the field of deep learning for disease diagnostics so large amounts of medical data can be processed faster [61,62] and satisfying results can more likely be reached [67,81]. However, an essential technical restriction of the more complex but performant deep learning approaches lies in the fact that the results of AI remain a black box to humans [125]. The outcomes are not always comprehensible, which makes it, on the one hand, nearly impossible to learn from the AI's decisions and, on the other hand, challenging to build trust in the system itself. Future research should therefore focus on improving the understandability and explainability of AI derived conclusions [99]. A transparent prediction-making process leads to a trustworthy relationship between the AI and medical experts [100]. Overall, we offer the following research proposition (RP):

Corroboration and portability
There has been a recent development of AI algorithms which aim to assist in the diagnostic process, and which generally consider a single dataset based on textual input. There have been numerous studies indicating satisfying results [126,127]. However, there is a certain risk that results are not applicable to other domains and only deliver proper results for a specific application. Future research needs to corroborate these findings in a diverse patient population [116]. This can be achieved by using heterogeneous and larger datasets (i.e., with N > 1,000 samples) [55,89,96,[101][102][103][104][105][106][107][108][109] with a range of formats such as X-ray images or ultrasounds [56,110] which are currently almost neglected. Furthermore, larger datasets should commonly be splitted using crossvalidation to approximate external validity thus generate better outcomes [68]. Simultaneously, findings need to be transferred to other types of diseases, for example, different types of cancer [67,87,[111][112][113]118], but also to other clinical application [100,114,115,117]. In addition, one must ask why scientific evidence is not yet widely integrated into disease diagnostics, for instance, in hospitals or other clinical environments. The question remains whether AI approaches persist in real-world scenarios [118,128]. Thus, portability is a crucial factor in the future of AI in disease diagnostics [116,119]. Thus, we offer the following proposition: RP2: We propose more research on how AI in diagnostics can be adapted in other clinical environments and confirmed using larger datasets with enhanced validity. Additionally, we suggest the examination of AI for diagnosing disease in a real-world scenario, confirming its practical suitability.

Integration and collaboration
The diagnosis of different diseases is a strongly subjective process and a cognitively challenging task that depends on the clinician's individual experience and differs based on the emotions and mental state [4,42]. With the application of AI to the diagnostic process, medical experts are assisted by AI, possibly leading to superior results. However, current research on AI in disease diagnostics has exclusively dealt with technical implementations rather than being concerned with how AI might be integrated into existing technical infrastructure. Recent AI development may yield sufficient results for diagnosing diseases; however, it is still unknown in which way data will be presented to medical practitioners. Diagnostics still presupposes collaboration between humans and AI [129]. This requires intensified research on integration practices, especially on the development of user-friendly interfaces for multiple devices [66,79,83,118,121]. Researchers and healthcare practitioners should aim to develop AI in collaboration to reach a better outcome for patients [100]. Scientific endeavors could go even further by developing a system that could assist in the entire diagnostic process instead of just focusing on diagnosing a particular disease [122]. Furthermore, we argued earlier that collaboration between humans and AI can yield superior results. However, due to a lack of practical examples, future research needs to take a closer look at collaborative aspects when humans partner with AI in the diagnostic process. It may be that virtual human-AI teams outperform humans working in isolation [123]. This brings us to our final proposition: RP3: We propose more research on how AI can be integrated into existing technical infrastructure for assisting within the diagnostic process using suitable interfaces running on multiple devices. Moreover, scholars need to examine whether virtual teams consisting of humans and AI outperform medical teams or single expert efforts when diagnosing diseases.

Conclusion and limitations
In this article, we have illustrated the application of AI within diagnostics in current academic research. We presented our CR, which classified the retrieved literature according to organic systems, algorithms, dataset characteristics, and performance measurements. These results are useful for practitioners and healthcare researchers. The main theoretical contribution of this research is the proposal of a research agenda including exemplary research 1 3 questions. We thereby seek to guide researchers' efforts to encourage future research in the field of AI as part of medical diagnostics. Furthermore, illustrating the intensity of studies, highly correlated areas, and an overview of unexplored research is helpful for future deployment of AI for diagnosing diseases. On a practical level, practitioners understand the extent to which AI improves the diagnostic process and how the overall healthcare system benefits from it. Medical professionals understand how AI can be applied to diagnosing diseases, which could result in having suitable suggestions for further developing AI-based approaches. In addition, healthcare experts comprehend which challenges still need to be tackled before disease is diagnosed in collaboration with AI. In terms of implications for society, readers realize that AI is likely to be used in healthcare to diagnose diseases or at least assist during the process. Nevertheless, the application of AI as a component of the diagnostic process provides opportunities for innovative digital health and is simultaneously able to ensure enhanced patient outcomes.
This research is not free of limitations. First, it should be noted that not all existing AI-based algorithms were used in our classification. We classified a few approaches with "other which may indicate that these algorithms have been applied either rarely or probably not at all for diagnostic purposes. In addition, it could indicate that these algorithms are simply not suitable for application within the process of medical diagnosis. Furthermore, we looked at publications that have dealt exclusively with the technical application of algorithms. Therefore, we have limited our research and may have missed the retrieval of relevant literature dealing with AI in the diagnostic process from which we could have acquired additional findings. Moreover, in respect to future research directions, researchers might consider looking at the entire course of the diagnostic process and whether and how AI can be used in ways other than the diagnosis of disease Table 9 and 10.         [194] Infectious Neural Network Turkish Journal of Engineering [195] Cancer Neural Network Information Sciences [196] Cancer Neural Network Journal of Mechanics in Medicine and Biology [197] Cardiovascular Neural Network International Conference on Neural Networks [198] Gastrointestinal Neural Network Journal of Medical Systems [199] Infectious Neural network International Joint Conference on Neural Networks   Author contributions Each author contributed equally.
Funding Open Access funding enabled and organized by Projekt DEAL. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for profit sectors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.