1 Introduction

The application of artificial intelligence (AI) provides advantages pertaining to the diagnosis of diseases. The healthcare system is a dynamic and changing environment [1] and medical specialists continually face new challenges with changing responsibilities and frequent interruptions [2, 3]. This variety regularly leads to the diagnosis of disease becoming a side issue for healthcare experts. In addition, the clinical interpretation of medical information is a cognitively challenging task [4]. This not only applies to experienced professionals but also to actors with different or little expertise such as young assistant doctors [4, 5]. Medical specialists’ available time is usually limited [2, 3] and diseases might evolve and patient dynamics change over time [6, 7] making diagnostics a highly complex process [8, 9]. However, an accurate diagnostic process is of central relevance to insure timely treatment and, thus, to achieve safe and effective patient care [10, 11].

The importance of AI as a component of the diagnostic process has been steadily increasing since building systems became more practical [12, 13]. There is ongoing enthusiasm for and hype about AI [14,15,16], and both researchers and practitioners focus equally on this technology from multiple perspectives [17,18,19, 35]. There is no uniform definition for the term AI [20], but considered as “the ability of a machine to perform cognitive functions that we associate with human minds, such as perceiving, reasoning, learning, interacting with the environment, problem solving, decision-making, and even demonstrating creativity” [21]. AI is generally associated with human-like behavior [22] and covers a wide range of research areas, such as natural language processing or robotics [23]. However, current practical applications, including healthcare and disease diagnostics, are narrowed down to a specific task [24], and are being developed using machine learning [12, 13]. Algorithms exploit medical data to generate predictions [22] and continuously learn and develop over time by constantly processing new and updated data [25]. Algorithms acquire information through different types of knowledge and input or over multiple years of experience [25]. Therefore, AI empowered systems are able to process more knowledge compared with humans [223], possibly outperforming them for certain medical tasks.

The application of AI within the diagnostic process supporting medical specialists could be of great value for the healthcare sector and the overall patients’ well-being. The integration of AI into existing technical infrastructure accelerates the identification of relevant medical data from multiple sources which are tailored to the needs of the patient and the treatment process [26,27,28,29]. Simultaneously, AI unchains silo thinking, such as sharing knowledge across departmental boundaries [30], as information from all involved areas is taken into account. Furthermore, AI generates results based on a larger population rather than on subjective, personal experiences [31] and achieves equal results when using identical medical data and does not rely upon situations, emotions, or time of day [32].

Despite the potential shown by AI as a component of medical diagnosis, to date, no thorough analysis of approaches has been conducted, and a comprehensive conceptualization of algorithms that have been previously applied in the diagnostic process for various diseases remains lacking. Recent studies have examined the application of AI in healthcare in general [33,34,35] or in specific clinical domains [36]. For example, Jiang et el. [33] focused on classical AI algorithms such as support vector machines and neural networks within the entire healthcare sector. Another study by Rauschert et al. [36] dealt with AI in clinical epigenetics whereby the authors investigated individual treatment characteristics of patients based on genetic and epigenetic profiles. In contrast, our research concentrates on the diagnostic process and to what extent AI is being integrated. Specifically, we aim to provide a classification of the current state of AI in disease diagnostics and guide future research directions. Our study uncovers which algorithms are particularly suitable for which disease diagnosis and where other approaches may be more advisable. By pointing out explicit performance measurements, we explain the overall suitability of existing strategies. We argue that this is of great interest to researchers and practitioners since the relevance of AI within the diagnostic process will continue to increase. There is an urgent demand for academia to provide an overview of the AI-based approaches that are being applied to understand the intricacy of this ever-evolving research area. To address these pressing issues, our study is guided by the following research question:

RQ: How is artificial intelligence currently applied in healthcare to facilitate the diagnostic process, and what are the pertinent future research directions?

We have followed the latest methodological guidelines for literature reviews [37, 38] and conducted a critical review (CR) analyzing existing literature on a broad topic to reveal weaknesses, inconsistencies, problems, or discrepancies [39]. We initially reviewed the status quo based on the existing literature on AI in medical diagnostics, for which we focused on research articles that dealt with practical applications. This has guided us in portraying the current AI landscape in the diagnostic process, including areas that have not yet been adequately covered. By conducting a critical assessment, we are able to provide scholars with knowledge about future research directions and improvements [38].

This paper structures our understanding of AI algorithms that have been researched as components of medical diagnostics. Researchers will find the overview of application fields helpful when considering unexplored areas where the deployment of AI for healthcare services could be beneficial and future research might be necessary. Practitioners will be better able to understand the extent to which AI improves the diagnostic process and how healthcare specialists and patients, as well as the overall treatment processes, benefit from it. Society will comprehend how AI is likely to be used within diagnostics including its current deliberations. We believe this study will be valuable to drive related research and practical implementations as well as lead to a constructive debate on how AI applications can be further integrated into the diagnostic process to enhance overall patient outcomes.

2 Related work

The fundamental goal of the diagnosis of disease lies in determining whether a patient is affected by a disease or not [40]. Diagnosis can be seen as a process or classification or as a “pre-existing set of categories agreed upon by the medical profession to designate a specific condition” [41]. The entire diagnostic process, in general, is a sophisticated [8] and “patient-centered, collaborative activity that involves information gathering and clinical reasoning with the goal of determining a patient’s health problem” [10]. The Committee on Diagnostic Error in Health Care developed a conceptual model to comprehend the diagnostic process (cf. Figure 1).

Fig. 1
figure 1

Conceptualization of the diagnostic process (adapted from [10])

Initially, the patient experiences a health problem associated with the individual’s symptoms, which causes the person to contact the healthcare system, where sufficient information is collected via reviewing the patient’s clinical history and conducting an interview, performing a physical exam and diagnostic testing, and referral and consultation that involves other medical experts. Information gathering, integration, and interpretation, as well as providing a working diagnosis—for example, a single or differential diagnosis—together represent a continuous process that can be repeated several times. The working diagnosis and an explanation of it are shared with the patient, and appropriate treatment is planned. Finally, this process results in an outcome for patients and the healthcare system, such as learning from errors or a timely diagnosis.

This highly complex process has, on the one hand, great potential for errors and, on the other hand, may vary according to the medical discipline. A diagnosis is usually based on individual experiences of clinicians and may differ depending on the emotional and mental state of the clinician [42]. Medical specialists further have a broad spectrum of duties [2, 3], the time they can devote to determining a diagnosis is usually limited. Moreover, since several experts from multiple medical disciplines are needed for a diagnosis [43, 44], the process might be extended even though timeliness is critical when considering further treatment plans [10].

The unique challenges of modern medical practice and the wide variety of diagnostic strategies can create difficulties for the entire healthcare system, and thus, more systems are relying on information technology (IT) [45]. AI is already utilized to assist medical experts and improve the diagnosis of diseases, for example, in early diagnosis of ectopic pregnancies and helping gynecologists with their decisions about initial treatment [46]. This progress is due to improvements of technical capabilities such as processor speed and storage capacity, while costs have dropped [47]. Furthermore, the amount of medical data is continuously growing, and systems are now capable of identifying, extracting, and processing information from different sources more quickly than previously [48, 49]. In addition, the technology and algorithms needed to implement systems are now widely accessible and can be easily applied [50] and are commonly classified in supervised, unsupervised, and deep learning [36, 51].

Supervised learning describes algorithms that learn associations based on existing samples or training data [52, 53]. Labels of a dataset are known, for example, as images of fractures or ruptures that have been classified by medical specialists. This information is used to train algorithms that generate predictions for unused data [22, 54]. Supervised learning depends on a given user input and is 3, thus sensitive to data quality [22]. Poorly or incorrectly labeled data lead to faulty predictions, and trained algorithms might even be biased. Supervised learning is one of the most used AI approaches, as it provides robust classifications [36]. Common examples include logistic regression or neural networks. Supervised learning approaches have been found to aid in the diagnosis of dementia [55] and cancer [56].

Unsupervised learning covers self-organizing algorithms that learn associations without existing samples or training data [57, 58]. It is well suited for identifying correlations within a dataset but are unable to determine their statistical relevance [59]. Moreover, unsupervised learning might also identify irrelevant clusters if these are grounded in the data. This approach does not necessarily depend on user input but demands verification of the plausibility and salience of the identified clusters [22, 60]. Common examples include clustering dimensionality reduction algorithms. Unsupervised learning is used, for example, for hepatitis disease diagnostics [4].

Finally, deep learning algorithms learn correlations among data via evolutionary tests that continually adjust predictions according to the given data [61, 62]. Deep learning might be considered an individual AI approach. However, it must be viewed from various angles due to its capability of combining supervised and unsupervised approaches [36]. It is particularly suitable for handling large datasets with multiple dimensions and input sources [61, 63], but the outcome is inexplainable due to its sophisticated structure and thus represents a black box for users, eventually posing a high risk for patients’ well-being [64, 65]. Common examples include recurrent neural networks and convolutional neural networks. Examples of deep learning in diagnostics are the classification of dermatological diseases [66] and atrial fibrillation detection [67].

The development of supervised, unsupervised, and deep learning algorithms presupposes that the available data can be split into training and test sets [33, 36]. As part of training, the algorithm is optimized by tweaking different parameters and attributes of the data. Within testing, the performance of the algorithm is validated. Both sets usually use a subset of the same dataset. Their separation does not follow any specific rules, but the training set is normally larger than the test set. An increasingly popular alternative to purposefully separating the data is the use of cross-validation, where the data is randomly split into multiple training and test sets to approximate the external validity of the algorithm [68]. Finally, the algorithm and the generated model are validated using unrelated, unseen data, for example, genuine medical data.

Various performance measurements are applied to evaluate the efficiency of a trained model. The most common measurements found in scientific publications are accuracy, sensitivity, and specificity [69, 70]. Accuracy estimates the correct classification out of all classifications on a range from 0–1 (the equivalent of 0–100%). Sensitivity explains how many patients with a disease have been correctly identified with this disease (true positive rate), on a range from 0–1 (0–100%). Specificity, in contrast to sensitivity, determines how many patients without a disease have been correctly identified without this disease (true negative rate), on a range from 0–1 (0–100%) [36]. Thus, higher values for accuracy, sensitivity, and specificity indicate a well-trained model that provides the most accurate results.

3 Research approach

The objective of this research lies in identifying in which areas of the diagnostic process AI has already been applied and how these approaches are to be evaluated. We performed a CR to identify relevant literature and analyze it according to the AI algorithms, their medical applications, dataset characteristics, and performance measurements. Since we were interested in providing a holistic picture, we decided to start our literature search within the information systems (IS) discipline. We argue that IS, as interdisciplinary community addressing dynamic and evolving sociotechnical challenges, suit as a valid starting point for guiding medical specialists. Within the individual steps of the research procedure, we expanded our search to other scientific domains and outlets. For example, we identified publications from healthcare (e.g., Annals of Emergency Medicine) and biomedical engineering (e.g., Computers in Biology and Medicine).

As CRs are frequently performed in an unsystematic manner [39], but benefit from an informative explanation of how the literature was retrieved [37], we adopted a systematic approach in our research procedure. This insured that we would avoid having a subjective aggregation of existing literature [39] but instead make a significant contribution by searching for the most relevant literature using keywords in scientific databases [71, 72]. We used a descriptive review focus since an interpretable pattern from the existing literature would thereby be generated [73], leading to a depiction of the current state of our research domain [74]. The research procedure, as illustrated in Fig. 2, considers articles up until December 2020.

Fig. 2
figure 2

Research procedure

The literature search was conducted with the help of litbaskets.io, a novel IT artifact specifically designed to assist researchers in retrieving relevant literature from credible scientific sources [75]. Unlike in other areas, for example in medicine or psychology, there is no uniform database including the broad spectrum of outlets for conducting comprehensive literature searches in the IS discipline. Litbaskets assists scholar in the concise and precise selection of relevant publications. Technically, a search string is created which is used by Scopus’s advanced search, making it possible to seek across indexed scientific sources [75]. Thus, litbaskets does not collect articles itself, but uses the ISSN numbers of selected outlets to grasp articles according to the search term.

We deliberately chose 154 essential IS journals because we wanted to focus on substantial and high-quality articles.Footnote 1 The literature search considered all peer-reviewed articles; less relevant sources, such as editorials, were excluded. We carried out a full text and metadata search; we deliberately did not limit it to metadata only, as it cannot be guaranteed that the search term is always contained in the metadata, thus possibly resulting in overlooking relevant publications. We used the following query for our full text search:

(AI or “artificial intelligence”) and (diagnostic or diagnostics)

We purposely chose a rather broad term pursuing a central concept of CRs: analyzing a research topic using a wider scope compared to other procedures [39], aiming to reveal “weaknesses, contradictions, controversies, or inconsistencies” [38]. We used the terms AI and artificial intelligence as well as the plural spelling of diagnostic covering the focus of our research. We did not use, for example, certain names of machine learning algorithms to avoid needlessly limiting our search. For concentrating on retrieving literature of AI in disease diagnostics while bypassing narrowing down our initial search, we further did not include related terms to diagnostic, such as diagnosis or diagnose. This ensures a thorough analysis of existing approaches as a broad term is crucial for providing a comprehensive overview yielding in an agenda guiding future directions.

After retrieving the literature, we carefully read the title, abstract, and keywords of each publication to determine its relevance to our research question. We focused on papers that specifically considered and applied AI in the context of diagnostics. Our guiding questions were: Which AI methods were applied to improve the diagnostic process? What algorithm is used for which diseases, and how can medical specialists be assisted in the diagnosis of diseases? How have algorithms been developed for disease diagnostics? How accurate are the results of the individual applications? Publications that compared algorithms under theoretical conditions or those that did not disclose the applied algorithm were not considered in our research.

Since a basic search is not able to provide a comprehensive review, and we were highly interested in retrieving relevant literature outside the IS discipline, we further conducted a backward search to identify additional literature. We collected all references listed in the bibliographies of all the papers from the initial search, in which we only included references to other scientific publications, such as web pages, panel discussions, or business reports which did not fit our research goals. The relevance of the papers was determined via the same approach we used within the initial search: we read the title, abstract, and keywords, which was followed by categorization according to theoretical foundations. The final step included a forward search to further identify relevant publications, whereby we acknowledged all papers that were retrieved in the initial and backward searches. We were interested in articles that had been cited by other researchers after their initial publication. Once again, we read the title, abstract, and keywords and performed the categorization process as outlined above. We deliberately did not decide adding keywords for filtering or excluding, for example, theory driven research articles to prevent overlooking relevant literature. Even though this manual process was very time consuming, it simultaneously ensured gathering the largest possible number of relevant publications.

Following the recommendation of Bem [76] for a conceptual structuring of the research topic, the categories of the examined disease and applied learning type served as a conceptual pre-structure whereby the literature was roughly categorized. Afterwards, we examined and compared the retrieved literature to identify correlations and similarities. Prior research was then grouped based on its theoretical foundations and finally reviewed concerning quantitative criteria. We initially clustered the publications according to the learning types, using supervised and deep learning, as these approaches provide more clinical results and are thus more likely to be found in healthcare [33]. A cross-check further revealed only one paper that dealt with unsupervised learning. Following Jiang et al. [33], we then used the most common machine learning algorithms applied to develop AI applications [12, 13, 22] as classifications within supervised and deep learning. This resulted in 12 clusters: neural network, support vector machine, nearest neighbor, random forest, decision tree, logistic regression, naive Bayes, discriminant analysis, convolutional neural network, deep neural network, recurrent neural network, and others. We further clustered the retrieved articles according to their examined disease. Since this approach was rather granular, we assigned the disease to the affected organic system (i.e., cardiovascular, dermatological, gastrointestinal, infectious, metabolic, neurological/psychiatric, pediatric, pulmonary, and urogenital). The resulting classification according to the algorithm combined with the organic system helped us to interpret the results in a more holistic way. Finally, we analyzed the dataset characteristics and performance measurements of the algorithms. The datasets were described by their origin, sample size, number of features (e.g., patient characteristics, such as age or smoker), training, and testing sample. The performance was assessed using accuracy, sensitivity, and specificity. The quality criteria aided us in determining the value of algorithms applied for disease diagnostics as well as critically examining whether developments and applications seemed reasonable. Table 1 exemplifies how we performed the categorization by providing example articles and their classification. Please note that one paper might have dealt with multiple algorithms. Table 2 and 3 explain the organic systems and algorithms used for clustering.

Table 1 Example publications and exemplary assignment to their categories
Table 2 Explanation of organic systems used for clustering
Table 3 Explanation of algorithms

4 Results and analysis

The execution of the CR resulted in 126 relevant articles, of which 29 were retrieved via the initial search, 17 via the backward search, and, finally, 80 via the forward search. We analyzed the articles according to their distribution by year, publication outlet, distribution by category, and, finally, dataset characteristics and performance measurements. We thereby obtained an overview of the AI-based approaches examined in the literature, including their suitability for the diagnostic process, and critically evaluated these results.

There is no prototype or implementation of a developed algorithm or model that is actually used in healthcare for the purpose of diagnosing diseases or assisting within the diagnostic process. There have been limited studies that presented user interfaces as spin-off product that theoretically provides a basis for its deployment in disease diagnostics. For example, Ogah et al. [78] developed a knowledge-based system including a user interface for diagnosing hepatitis B via a neural network. However, the study focused on the development of the algorithm and underlying database rather than proving a suitable graphical interface for users. Furthermore, most studies have employed textual medical data to implement suitable algorithms for disease diagnostics. Mishra et al. [66] curated a database with 4,700 images of nine common dermatological diseases, such as acne, erythema, or wheal, and used deep learning based for correct classification. Another example is captured transvaginal ultrasounds of pregnant women to detect ectopic pregnancies [46]. Nevertheless, these studies are rather the exception; most approaches have developed a specific algorithm to diagnose a particular disease. A limited fraction of research combines algorithms to achieve better results. For example, decision trees and case-based reasoning were integrated into an intelligent model for liver disease diagnosis, indicating considerable accuracy compared with single method concepts [79]. Likewise, a small number of studies implemented and compared multiple algorithms and considered which one best fit the diagnosis of a disease. Studies performed predictive modeling via multiple algorithms, that is, logistic regression, random forest, decision tree and support vector machine, for the early detection of Parkinson’s disease [80]. Lu et al. [56] compared decision tree, logistic regression, nearest neighbor, neural network, and support vector machine to assist in cervical cancer diagnosis. Although there are nascent approaches to compare algorithms, there is no scientific publication that has yet investigated whether one approach is suitable for the diagnosis of several diseases.

4.1 Distribution of articles by year

No articles were published before 1990, but articles related to AI in diagnostics have increased substantially over time (cf. Figure 3). Since 2013, the number of publications has risen considerably, which is in accord with technological improvements of AI. The increasing number of studies is a product of the enhanced and ever-growing technical capabilities and the quantity of medical data [47, 50]. The constant growth of publications reflects not only the increased demand for AI within the disease diagnostics but also represents the salience and legitimacy of this research area.

Fig. 3
figure 3

Total number of articles per year

4.2 Distribution of articles by outlet

From a total of 126 articles, 105 (83.3%) were in journals, and only 19 (16.7%) were conference publications. Most articles (52, 41.3%) came from 12 outlets and were published in practice-oriented journals. This is understandable since we explicitly searched for articles that examined AI within the diagnostic process. The highest number of journal articles were published in Expert Systems with Applications (15, 11.9%). Furthermore, a large portion of publications were from outside the IS discipline; this was acceptable and suspectable as our focus was clearly on the applications in healthcare. Thus, most practical applications of AI in disease diagnostics were to be found in medical outlets. Table 4 outlines the distribution of articles by outlets with more than one publication.

Table 4 Distribution of articles by outlets with more than one publication

4.3 Distribution of articles by category

Among the organic systems, cardiovascular disorders clearly stood out as the most heavily researched area, with 34 articles (27.0%). Neurological/psychiatric (20, 15.9%), cancer (18, 14.3%), gastrointestinal disease (15, 11.9%), and infectious disorders (13, 10.3%) have all been studied with similar intensity. The least published articles were found for pulmonary and urogenital disease, with four articles each (3.1%). The small number of papers might indicate a lack of perspective for those research fields. In the case of the algorithms, neural networks were by far the most researched area, with 71 articles (42.5%) examining this algorithm in the context of the diagnosis of diseases. The support vector machine was the second most researched area, with 35 articles (21.0%). Deep learning approaches, such as, deep neural network (2, 1.2%) and recurrent neural network (2, 1.2%), as well as a mixture of other applied algorithms, which do not fall under the scope of our classification, were the least considered algorithms. Table 5 depicts the number of artciles in each category.

Table 5 Number of articles in categories (one paper might have dealt with multiple algorithms)

For validating if a research domain has received much attention and whether certain areas are highly correlated, we compared the organic systems and algorithms the papers have dealt with using a research assignment matrix (cf. Table 2). We matched the corresponding organic systems with each algorithm. The illustration aided us in examining and allocating the results of the large number of relevant publications. The matrix provides an overview of the status quo—areas with less research do not indicate that further studies are inevitably required there. However, the matrix highlights that transferring previously gained knowledge of the application of AI within diagnostics to other research areas seems possible and advisable. It appears that most of the algorithms were already being used in the diagnosis of neurological/psychiatric diseases (10), followed by cardiovascular and gastrointestinal disorders (7), and urogenital diseases (7). Dermatological, infectious, metabolic, and pulmonary diseases (3) were the least examined. In terms of the algorithms, neural networks (8) and support vector machines (8) were the most researched. Both methods have been applied to the diagnosis of diseases for nearly every organic system. Deep learning approaches, that is, deep (1), recurrent (2), and convolutional neural networks (4) have not been much applied in diagnostics. They are currently limited to the diagnosis of only a few diseases.

4.4 Dataset characteristics and performance measures

The examined literature varied considerably in its depth of detail and presentation of the data used as well as the results achieved. Table 6 outlines example dataset characteristics and performance measures.

Table 6 Research assignment matrix

ACC = Accuracy, SEN = Sensitivity, SPE = Specificity, SP = Sample, FR = Features, TR = Training, TE = Testing, ? = no information, cv = cross validation.

We distinguished the origin of datasets as self-retrieved information (37), using an existing database (51), medical data grounded on other studies (7), and not providing any details (31). The results showed that most of the studies did not reveal detailed information on the data origin or how the data were collected, including circumstances or contexts. However, an in-depth presentation and explanation of the origin of medical data used to develop algorithms for disease diagnostics is a vital component for interpreting outcomes and verifying whether findings are adaptable and generalizable. A positive example was presented by [46] who used clinical data based on a long-term study from the Department of Obstetrics and Gynecology of the University Hospital “Virgen de la Arrixaca” in the Murcia region of Spain from November 2010 to September 2015. A total of 406 cases of tubal ectopic pregnancies of women from 16 to 46 years of age visiting the emergency room or the first-trimester-pathology unit were collected. The authors elaborated on personal and medical variables, outlined which data were gathered, such as 2D transvaginal ultrasound, and who examined the patients. Most studies developed an algorithm based on medical data from an existing database (e.g., Machine Learning Repository of the University of California at IrvineFootnote 2). This does not necessarily entail any negative consequences. However, one might argue that researchers are more familiar with self-retrieved data leading to better results in terms of algorithms’ efficiencies. Existing data might be used with the objective of testing a developed algorithm without really aiming to diagnose a disease, instead just fulfilling a purpose.

The quantitative details of the datasets have also shown considerable discrepancies among the studies. Other than a few articles (9), the majority outlined the exact sample size of the dataset; however, the number varies considerably, ranging from 9 to 212,554. Studies have rarely used a large dataset (N > 1,000), and we found only a subset of nine publications which considered large samples. For example, [67] analyzed atrial fibrillation using a convolutional neural network based on 150,060 samples. In contrast, [89] examined 53 patients to diagnose osteoporosis. Small sample sizes hinder the possibility for a generalization to a larger population; thus, results must be interpreted with caution. There is no “one size fits all” answer to the requirement for a minimum sample size. However, larger samples are favorable for achieving better results [90] especially since small datasets frequently lack in detecting certain patterns [91,92,93]. Furthermore, small sample size produce bias even if there are above 1000 records used [94].

The identified training and testing samples first showed that the data were usually split according to certain patterns (e.g., 60/40, 70/30, 80/20). However, it is remarkable that only a small number of studies (26) used cross-validation to randomly split the data to approximate the external validity of the algorithm and thus generate better outcomes [68]. Furthermore, the algorithms are typically trained and tested via a subset of the dataset and afterwards validated using the same entire dataset again. This is a common approach, nevertheless, it seems arbitrary as the results are analyzed in isolation without the inclusion of separate and previously unknown medical data.

Similar to the characteristics of the datasets, the reporting of the performance measures was rather patchy. Sixty-six out of the 126 studies reported detailed information on the three measures. The remaining publications were missing at least one value, whereby accuracy was usually used to assess the efficiency of the algorithm. However, reporting accuracy as a single performance measurement might be misleading [95]. The results of the studies were mostly quite promising; the overall accuracy ranged from 61.42 to 100.00 (M = 86.85; SD = 24.82), sensitivity from 59.00 to 100.00 (M = 62.82; SD = 43.84), and specificity from 60.53 to100.00 (M = 58.32; SD = 45.42). Shrivastava et al. [96] contributed a solid example for thorough reporting of necessary performance measures while achieving high results for disease classification in dermatology. However, other studies failed to grant insights into their performance, hindering the interpretation of findings on a more general level.

The evaluation of algorithms in disease diagnostics without knowing detailed information about the used dataset and achieved efficiencies prevents the generalization of results and transferring them into a different context. The need for comprehensive information is crucial in understanding studies justification and contribution to theory and practice. There is frequently too much information missing for undertaking an adequate evaluation.

5 Discussion and future directions

We critically reviewed identified publications according to both strengths and weaknesses. Following the recommendations for CRs [38], we set out to determine directions for future research. We categorized our findings under a suitable heading that deals with identical or similar issues. Table 7 and 8 summarizes exemplary future research questions according to the four identified areas.

Table 7 Example dataset characteristics and performance measures
Table 8 Future research questions

5.1 Advancements and explicability

Recent studies have been primarily concerned with the development of a specific algorithm for the detection of a particular disease [87, 124]. Moreover, some algorithms have been more researched than others. Future research should examine possible advancements by combining the various existing algorithms to achieve better results compared with the isolated observation of a single algorithm [97, 98]. We strongly recommend more research in the field of deep learning for disease diagnostics so large amounts of medical data can be processed faster [61, 62] and satisfying results can more likely be reached [67, 81]. However, an essential technical restriction of the more complex but performant deep learning approaches lies in the fact that the results of AI remain a black box to humans [125]. The outcomes are not always comprehensible, which makes it, on the one hand, nearly impossible to learn from the AI’s decisions and, on the other hand, challenging to build trust in the system itself. Future research should therefore focus on improving the understandability and explainability of AI derived conclusions [99]. A transparent prediction-making process leads to a trustworthy relationship between the AI and medical experts [100]. Overall, we offer the following research proposition (RP):

RP1: We propose more research on how AI can be implemented to achieve better diagnostic results. Novel development strategies need to ensure understandability and explainability of the AI’s result and create a transparent and trustworthy environment in which medical experts are assisted.

5.2 Corroboration and portability

There has been a recent development of AI algorithms which aim to assist in the diagnostic process, and which generally consider a single dataset based on textual input. There have been numerous studies indicating satisfying results [126, 127]. However, there is a certain risk that results are not applicable to other domains and only deliver proper results for a specific application. Future research needs to corroborate these findings in a diverse patient population [116]. This can be achieved by using heterogeneous and larger datasets (i.e., with N > 1,000 samples) [55, 89, 96, 101,102,103,104,105,106,107,108,109] with a range of formats such as X-ray images or ultrasounds [56, 110] which are currently almost neglected. Furthermore, larger datasets should commonly be splitted using cross-validation to approximate external validity thus generate better outcomes [68]. Simultaneously, findings need to be transferred to other types of diseases, for example, different types of cancer [67, 87, 111,112,113, 118], but also to other clinical application [100, 114, 115, 117]. In addition, one must ask why scientific evidence is not yet widely integrated into disease diagnostics, for instance, in hospitals or other clinical environments. The question remains whether AI approaches persist in real-world scenarios [118, 128]. Thus, portability is a crucial factor in the future of AI in disease diagnostics [116, 119]. Thus, we offer the following proposition:

RP2: We propose more research on how AI in diagnostics can be adapted in other clinical environments and confirmed using larger datasets with enhanced validity. Additionally, we suggest the examination of AI for diagnosing disease in a real-world scenario, confirming its practical suitability.

5.3 Integration and collaboration

The diagnosis of different diseases is a strongly subjective process and a cognitively challenging task that depends on the clinician’s individual experience and differs based on the emotions and mental state [4, 42]. With the application of AI to the diagnostic process, medical experts are assisted by AI, possibly leading to superior results. However, current research on AI in disease diagnostics has exclusively dealt with technical implementations rather than being concerned with how AI might be integrated into existing technical infrastructure. Recent AI development may yield sufficient results for diagnosing diseases; however, it is still unknown in which way data will be presented to medical practitioners. Diagnostics still presupposes collaboration between humans and AI [129]. This requires intensified research on integration practices, especially on the development of user-friendly interfaces for multiple devices [66, 79, 83, 118, 121]. Researchers and healthcare practitioners should aim to develop AI in collaboration to reach a better outcome for patients [100]. Scientific endeavors could go even further by developing a system that could assist in the entire diagnostic process instead of just focusing on diagnosing a particular disease [122]. Furthermore, we argued earlier that collaboration between humans and AI can yield superior results. However, due to a lack of practical examples, future research needs to take a closer look at collaborative aspects when humans partner with AI in the diagnostic process. It may be that virtual human–AI teams outperform humans working in isolation [123]. This brings us to our final proposition:

RP3: We propose more research on how AI can be integrated into existing technical infrastructure for assisting within the diagnostic process using suitable interfaces running on multiple devices. Moreover, scholars need to examine whether virtual teams consisting of humans and AI outperform medical teams or single expert efforts when diagnosing diseases.

6 Conclusion and limitations

In this article, we have illustrated the application of AI within diagnostics in current academic research. We presented our CR, which classified the retrieved literature according to organic systems, algorithms, dataset characteristics, and performance measurements. These results are useful for practitioners and healthcare researchers.

The main theoretical contribution of this research is the proposal of a research agenda including exemplary research questions. We thereby seek to guide researchers’ efforts to encourage future research in the field of AI as part of medical diagnostics. Furthermore, illustrating the intensity of studies, highly correlated areas, and an overview of unexplored research is helpful for future deployment of AI for diagnosing diseases. On a practical level, practitioners understand the extent to which AI improves the diagnostic process and how the overall healthcare system benefits from it. Medical professionals understand how AI can be applied to diagnosing diseases, which could result in having suitable suggestions for further developing AI-based approaches. In addition, healthcare experts comprehend which challenges still need to be tackled before disease is diagnosed in collaboration with AI. In terms of implications for society, readers realize that AI is likely to be used in healthcare to diagnose diseases or at least assist during the process. Nevertheless, the application of AI as a component of the diagnostic process provides opportunities for innovative digital health and is simultaneously able to ensure enhanced patient outcomes.

This research is not free of limitations. First, it should be noted that not all existing AI-based algorithms were used in our classification. We classified a few approaches with “other which may indicate that these algorithms have been applied either rarely or probably not at all for diagnostic purposes. In addition, it could indicate that these algorithms are simply not suitable for application within the process of medical diagnosis. Furthermore, we looked at publications that have dealt exclusively with the technical application of algorithms. Therefore, we have limited our research and may have missed the retrieval of relevant literature dealing with AI in the diagnostic process from which we could have acquired additional findings. Moreover, in respect to future research directions, researchers might consider looking at the entire course of the diagnostic process and whether and how AI can be used in ways other than the diagnosis of disease Table 9 and 10.