Introduction

Artificial intelligence (AI) is presented today as a transformative technology in the delivery of healthcare that will improve the quality of care and increase physician efficiency. Amid the promise and excitement, there is little focus on realistic expectations for the time and effort required for the successful adaptation of new technology. In 1987, Nobel prize winning economist Robert Solow observed, “You can see the computer age everywhere but in the productivity statistics” [1]. This productivity paradox, a delay of years or even decades between the adoption of a new technology and productivity increases, has been found across all economic sectors [2]. The productivity paradox after investment in computer technology was repeatedly detected in the healthcare sector [3,4,5]. Successful adaption of any major technology requires complementary investment in process engineering, organizational changes, and widespread training on new skills and techniques, before there is an increase in productivity [2, 6].

In contrast to the high expectations for AI in medicine, the widespread adoption of AI products is associated with a similar productivity paradox [6, 7]. For example, the USA productivity growth from 2005 to 2019 was half of that from 1995 to 2004 [6, 7]. No amount of enthusiasm can overcome the difficulties in deploying new technology in any industry, especially medicine. The many unique and difficult issues involved in the delivery of healthcare will delay productivity increases with the implementation of AI technology.

The purpose of this narrative review is to discuss major challenges to successfully implementing AI in clinical practice, with a focus on psychiatry. The issues relate to the maturity of AI technology, physician attitudes toward and knowledge of technology, workflow impacts, ongoing organizational support, patient safety, and problems of treating mental illness.

What AI is and is Not

AI is often promoted as the solution for many problems. Despite this assertion, the current AI technology does not possess human general intelligence, high-level reasoning, common sense, or the superhuman intelligence of science fiction [8,9,10]. Current AI technology is made possible by the massive databases available from the continuous creation and collection of data, including numbers, text, video, and audio, from diverse, interconnected computers and smart, everyday devices embedded with computing technology. Commercial AI products do not involve high-level reasoning or thought, but typically provide services based on large datasets that may augment human intelligence and decision making [8, 11•]. For example, after human evaluation, results from a search engine result may augment knowledge, or a spell checker may improve a document. In the commercial world, business models using AI tie decisions to large-scale datasets and focus on profits. The product recommender system used by Amazon is one example [12]. A profitable AI model with known errors may be acceptable to the corporate decision makers, regardless of inconvenience or costs to some customers [13, 14].

Most AI is based on machine learning (ML), including in medicine. ML blends concepts from many disciplines including computer science, statistics, and linguistics and includes many subsets such as deep learning [11•, 15,16,17]. ML algorithms use large training datasets to determine the best model for predicting an outcome, but the model itself remains an opaque “black box” [18,19,20]. ML has had the greatest success in situations with a very large signal-to-noise ratio (few data errors), such as visual or voice pattern recognition, or games with concrete rules [10, 20]. In contrast to ML, traditional statistical methods can be successfully estimated using both large and small datasets, but the model variables must be specified in advance. The focus of traditional statistical methods such as logistic regression is on understanding the relationships between independent data variables and the outcome or dependent variable. One example is using hypothesis testing to evaluate outcomes of controlled, clinical trials [21]. However, vast amounts of diverse data are increasingly available in medicine, including provider data from EHR, imaging, and genomics; patient data from Internet, smartphone and wearable activities, and data from non-medical sources such as government agencies [22, 23]. ML offers new approaches to the practice of psychiatry to analyze the massive datasets for prediction of the diagnosis, treatment selection, and illness course [24,25,26].

Maturity of AI Technology

The productivity paradox is related to the maturity of AI technology. When considering the introduction of AI in a safety–critical setting such as patient care, it is important to appreciate the current state of maturity of AI technology [27]. The technology readiness level (TRL) scale was developed by NASA in the 1970s to evaluate technical maturity, and has evolved to contain 9 levels [28]. The TRL scale produces a consistent measure to monitor progress in the development of new technology, promotes testing and verification to assess maturity, and provides assurance that the technology will function as intended [28, 29]. In the TRL scale, levels 1–4 refer to basic research in the laboratory, levels 5–6 to demonstrating the technology in a representative environment, and levels 7–9 to testing, validation, and successful deployment in an operational environment [28, 29]. The TRL scale is widely used internationally by governments and diverse industries [30].

The TRL scale was recently customized for rating the maturity of ML projects in a clinical setting, with TRL levels 3–4 referring to model prototyping and development and level 5 referring to model validation on other than the training population [31]. In an evaluation of 172 studies in intensive care medicine using ML, 160 (93%) scored a TRL level of 4 or below, with none successfully integrated into routine clinical care at level 9 [31]. In another evaluation of 494 studies in intensive care medicine using AI, 441 studies (89.3%) scored level 4 or below, 35 (7.1%) at level 5, with none successfully integrated into routine clinical care at level 9 [32•]. Technical maturity is very difficult to achieve for a new technology. The distance between academic discovery and successful commercialization is referred to as the “valley of death” [33,34,35]. The failure to advance the technology typically occurs between TRL levels 4–7, a stage often viewed as too applied for academia and too risky for commercial funding.

Recent Clinical Experiences Suggest AI Technology is Not Mature

Although there have been successes using AI in medicine, it is not widely deployed in routine clinical practice. There have also been unexpected results, errors, and failures using AI algorithms in many fields including medicine, as fundamental technological properties are being learned. AI image detection algorithms are routinely described as fragile and brittle since very small changes to the data may result in incorrect labels [27, 36], as demonstrated by a change to just 2% of pixels in an image [37], and even after a one pixel attack [38]. AI image detection algorithms may incorrectly learn to include confounding variables, such as the label “PORTABLE” from a chest X-ray machine when diagnosing pneumonia [39], a ruler present in an image when diagnosing malignant skin cancer [40, 41], the scanner model and brand, and orders marked “urgent” when diagnosing a hip fracture [42], and the chest tube used for treatment when identifying a pneumothorax [43]. The inclusion of confounding variables, often not clinical, may limit generalization, lead to incorrect findings, and emphasize the need to further understand and validate AI algorithms.

Automated speech recognition is impacted by individual accents, by historical and cultural stereotypes, and a lack of diversity in ML training data that results in disparities and biases by race and for non-native English speakers [44,45,46,47]. Racial bias was found in automated measurements of speech coherence designed to identify thought disorders [48, 49]. The amount of environmental noise including indoor background conversations decreases the reliability of speech recognition systems [50, 51]. A review of speech recognition for clinical documentation across specialties found the word error rate ranged from 7.4 to 38.7%, and the percentage of documents with errors ranged from 4.8 to 71% [52]. In an analysis of notes generated by speech recognition dictated by emergency department physicians, 71% contained errors and 15% of errors were potentially critical [53]. Conversational clinical speech recognition is even more complex [54], with estimates of word error rates using 7 commercial products ranging between 34 and 65% [55]. In a psychotherapy setting, the word error rate in psychiatrist identified harm-related sentences using a commercial product was 34% [56]. Many challenges and biases remain that impact the safe use of speech recognition in clinical practice.

The most advanced use of medical AI is in radiology with over 150 products for radiology containing AI algorithms cleared for use by the FDA [57]. Additionally, many sites use locally developed rather than commercial AI algorithms [58]. However, systematic reviews of medical imaging studies found little evidence that AI-based CDS improved clinician diagnostic performance [59], and there are few randomized and prospective trials behind AI claims [60]. In a 2020 survey from the American College of Radiologists with 1427 respondents, about 30% were using AI in clinical practice, but 94% of these rated the performance of AI as inconsistent [58]. The ECRI 2022 Top Ten Technology Hazards List includes AI-based image reconstruction, which can distort images reconstructed from the raw data obtained during MRI, CT, or other scans [61,62,63]. In addition to complex technical issues related to radiation dose, image capture, and reconstruction, radiology faces the same AI related problems found throughout medicine including unrepresentative and biased training data, workflow changes, productivity impacts, lack of external validation and validation standards, and performance deterioration over time [58, 64, 65]. A review of 62 studies to detect COVID-19 from chest X-rays and computed tomography images concluded that none were ready for clinical use due to methodological flaws and/or underlying biases [66].

Impediments to Maturity

There are impediments to the maturity of AI technology for routine use in clinical medicine that need to be recognized and directly addressed. Some of the key issues are briefly described.

Data Quality

The success of an AI algorithm is tied to the training data [67]. EHR and claim data were not developed for medical research, and there are many data quality issues related to missing data, inaccuracy, coding errors, biases, timeliness, redundancies, types of healthcare facilities, provenance or ownership trails, and lack of interoperability between vendor products [22, 68, 69]. Various factors contribute to the biases in EHR data. These include a lack of patient diversity, missing or discordant data on race and ethnicity, confounding medical interventions, oversampling of the sickest, fractured care across multiple providers, loss to follow-up, divergent processes within healthcare systems, measurement errors, and differences between recommended treatments in high versus low-resource settings [70,71,72,73,74,75,76,77,78].

There are special data quality concerns for psychiatry. Behavioral health related data are often missing or inaccurate in the EHR, including diagnoses, visits, and hospitalizations [79]. For example, studies have reported discrepancies and missing diagnoses in the EHR for PTSD [80,81,82] and a lack of documentation of suicidal ideation or attempts [83]. Stigmatizing symptoms may be underreported by the patient, and symptoms and diagnoses intentionally omitted by the physician [78, 82, 84, 85]. In large studies in the USA, 27–60% of patients prescribed psychotropic medications did not have a psychiatric diagnosis [86,87,88,89]. The transition from ICD-9-CM to ICD-10-CM in the USA in 2015 was associated with some coding changes, including reports of a decrease in the diagnosis of schizophrenia from 48 to 33 per 100,000 [90], and an abrupt increase in hospital stays for opioid and alcohol abuse [91, 92]. The electronic transfer of health information after discharge from inpatient psychiatric units occurs less often than from other areas of the hospital [93]. In an international review, less than half of studies in mental health settings reported implementing an EHR [94]. Additionally, many people seek help for mental health problems in non-medical settings that are not integrated into the EHR [95]. The data quality issues and biases in the EHR contribute to the substantial challenges and risks for developing a ML algorithm for the prediction of suicide attempts and deaths [96, 97].

Public databases are an important resource for ML research. However, using a database published for one task to train algorithms for a different task (“off-label” use) can lead to biased results [98]. As an example, the use of reconstructed and processed MRI images from public databases to generate raw MRI data to train image recognition algorithms can result in artificial improvement in algorithm performance of 25 to 48% [98]. This is due to the implicit filtering and smoothing of the reconstructed MRI image data that is used to recreate the raw image data.

Dataset Shift

When an AI algorithm is deployed in a setting where the production data differs from the training data, the algorithm often does not perform well. This is referred to as dataset shift [99•, 100, 101]. Dataset shift may be the result of a wide range of differences between the training dataset and the production population, including population demographics, treatments available, standard of care, measurement technology, practice settings, disease classification, and disease prevalence. Since healthcare practices change over time, temporal dataset shifts occur, and the size of the shifts vary with the clinical outcome being predicted [102, 103]. Dataset shifts also occur after changing from one EHR system to another [104].

For example, gender imbalance in training datasets led to decreased performance for diagnosis of thoracic diseases from X-ray images for the underrepresented gender [105]. The diagnostic performance of a ML algorithm to detect tuberculosis that was developed using a chest X-ray training dataset of one population fell when used with another population [106]. Population diversity in age, sex, and brain scanning site substantially affected the predictive accuracy of ML neuroimaging studies, including for autism spectrum disorder [107]. An algorithm to predict clinical orders by hospital admission diagnosis performed better when trained on a small amount of recent data (one month) than when trained on larger amounts of older data (12 months of 3-year-old data) due to changing practice patterns [108]. The performance of an ML algorithm to predict the risk of sepsis in ICU patients decreased over time, related to the shift from ICD-9 to ICD-10, and to hospital expansion that reshaped the population served [109]. AI algorithms to diagnose lesions in dermatology that were trained predominantly using white populations may underperform in patients with skin of color [110].

Dataset shift can also occur when a sensor in a measurement device, such as a wearable or smartphone, is different from the sensor used to create the training data [111, 112]. The sensors embedded in wearables and smartphones differ between manufacturers and between makes and models from the same manufacturer, resulting in measurement inaccuracies and inconsistencies [113,114,115]. Patient behaviors such as placement of the wearable may also contribute to measurement inconsistencies and dataset shifts [111, 112].

Before AI can be safely integrated into clinical practice, the many difficult issues related to data quality, EHR, dataset shift, and public databases must be addressed.

Physician Attitudes About AI

The success of AI technology in clinical settings depends on the physicians that use it.

Physician attitudes towards the use of AI in clinical medicine are generally positive, although there are concerns about ethical and legal issues, and perspectives often vary by specialty [116]. In an international survey of 791 psychiatrists, only 17% thought a computer could replace a human in providing empathetic care, while 75% thought a computer could replace a human in documentation tasks [117]. The overall acceptance of several new technologies by 515 psychiatrists in France was moderate, with 79.6% describing them as risky [118]. In a survey of 303 physicians of all specialties in Germany, the overall attitude toward AI in medicine was positive, but only 20.5% thought AI would help with the diagnosis of psychiatric disease [119]. The dominant perspective held by 720 general practitioners in the UK was that AI would have a limited impact on primary care, with the major benefits due to reducing administrative burdens [120]. Of 121 dermatologists in the USA, in a survey about AI screening tools, 49 (42%) were worried about human deskilling [121]. In a survey of 100 physicians in all specialties in the USA, although over 70% thought chatbots could assist patients with administrative tasks such as scheduling appointments and locating clinics, over 70% also thought that chatbots cannot effectively provide care for all patient needs or display emotion and could be a risk to patients due to incorrect self-diagnosis [122].

Several studies noted that most physicians lack education in AI. Although 71% of 632 radiologists, dermatologists, and ophthalmologists in Australia and New Zealand felt AI would improve their field, 80.9% had not used AI in clinical practice and 86.2% thought there was a need for improved education and guidelines to prepare for the introduction of AI [123]. In a survey of 699 physicians and medical students in South Korea, while 83.4% thought AI would be useful in medicine, only 6% said they had good familiarity with AI [124]. A survey of 210 postgraduate trainee physicians in the UK rated the current level of AI training as insufficient [125]. In an international survey of 209 psychiatrists, only 23.9% had any formal training in technology [126]. Physicians also have varied levels of formal training and knowledge of statistics [127, 128]. These survey responses highlight the importance of quality education in AI for clinical medicine. Physicians will need to understand how to critically assess the capabilities, benefits, limitations, and risks of AI in clinical practice, and AI training must be integrated across the wide range of medical education [67, 129, 130]. Education must also emphasize that the physician remains the primary decision maker, and the ongoing importance of human intelligence and skills in patient care [68, 131].

Safety Challenges

There are many reported safety challenges that need to be understood and addressed before AI can be routinely used in a clinical setting. The safety issues with AI are especially troubling, given the disconnect between the exuberant claims and the current maturity of AI technology.

Automation Bias and Deskilling

The interaction of humans and an automated decision support tool often leads to automation bias. Automation bias occurs when a user attributes more authority to an automated tool than to other sources of advice [132, 133]. This can result in the user following incorrect advice despite contradictory evidence or prior training, and the user failing to act without explicit prompting. There are examples of automation bias across medicine including for interpretation of electrocardiograms [134, 135], e-prescribing [136], whole slide image classification in pathology [137], and diagnosis of skin cancers [138]. Although the least experienced physicians may be most susceptible to automation bias [134], a major concern is that incorrect decision support misleads clinicians of all experience levels [134, 135, 138]. For example, incorrect AI has reduced the accuracy of expert physicians in the diagnosis of skin cancers [138], and the histopathologic classification of liver cancer [137]. To reach the potential for AI products to improve decision making in clinical practice, the vulnerability of even experienced physicians to faulty AI must be understood and addressed.

A possible long-term consequence of the overreliance on technology is deskilling of the physician workforce, due to a loss of individual skills and a reduction in skill development [139,140,141,142]. This is of particular concern given the frequently promoted perspective that AI is inherently exceptional, will outperform other technologies, and will outperform physicians [143, 144]. Another risk of overreliance on technology is that even when a failure is detected, some users do not want to proceed without the AI system [145]. Implementation plans for AI in medicine should include long-term efforts to reduce deskilling of physicians and other medical personnel.

Black-box Opacity

The black-box nature of AI algorithms poses a major obstacle for routine use in clinical medicine. Beyond the many technical issues, the opacity of AI algorithms is often due to intentional secrecy by private commercial organizations [146, 147]. Modern AI techniques were originally developed for low risk decisions such as online advertising and search engines [148]. In sharp contrast, where patient safety is at risk in medicine, physicians need to understand why an AI algorithm made a prediction [149]. A lack of interpretability will undermine trust in an AI algorithm, and the explanations must be presented to physicians in a manner that is clearly understandable. There are many ongoing efforts to provide interpretability of AI algorithms, with research often focused on healthcare. Although there are various approaches to provide interpretability, each method currently has important technical challenges and limitations [149,150,151,152,153,154]. Another problem is that explanations may contribute to inappropriate trust in the capability of an AI algorithm [155]. Due to automation bias, even incorrect explanations may increase trust in AI algorithms [156]. In a study presenting patient vignettes to 220 clinicians including 195 psychiatrists, ML recommendations did not improve selection of an antidepressant drug, and incorrect ML recommendations paired with easily understood explanations decreased selection accuracy [157•]. Additionally, physicians may not understand the limitations of the explanatory methods with respect to individual treatment decisions [150]. System design for the safe use of AI in medicine must focus on improving human–computer interactions [158].

Unanticipated Safety Challenges

Complex automated systems typically fail due to the unanticipated and unintended consequences of the design, even if the nominal purpose is achieved [159]. Adding any new technology will change the workflow, often profoundly, including the creation of new failure paths [68]. In medicine, a failing AI system can result in entirely new and unexpected types of safety hazards that physicians have not seen previously [160•]. Some failure modes for AI systems may be less obvious and harder to detect than those of conventional systems [161]. The study of AI system failures should also include identification of the worst possible mistake [162]. Any AI system that automatically initiates actions must provide explicit human alerting and override abilities [160•]. Significant human oversight of AI systems is especially important in safety–critical situations [27, 163]. When unexpected automation errors occur, a human must solve the problem [68, 164]. The more complex the automated system, the more essential the role of humans as the exception handlers, and the greater the negative impacts of deskilling. The safety of any software system, including AI, must be thoroughly evaluated in the specific environment, workflow, and context in which it is used [165].

Validation Issues

When using an AI product, the physician relies on the validation and regulatory approval process to confirm the product works as promised, and to understand the limitations and risks. There are many challenges to validating AI algorithms including data quality and dataset shift issues, brittleness and fragility of algorithms, black-box opacity, human factors, overall system context and complexity, and software errors [166,167,168]. The acceptable level of accuracy for the AI algorithm must be determined, given the intended use. The inappropriate choice of internal validation method can artificially inflate estimates of ML algorithm performance [169]. Notably, it is more difficult to validate AI algorithms than traditional statistical models since the results may change over time as the algorithm learns [167]. The reproducibility of ML in healthcare compares poorly to ML used in other fields, as only 23% of over 200 studies between 2017 and 2019 used multiple datasets to establish results [170]. Additionally, multiple testing approaches are required even for high performing algorithms, since unexpected and potentially harmful errors may appear when using different methodologies [162, 171].

Although the number of approved AI-based medical devices has increased in the USA and EU in recent years [172], the current state of USA regulation of medical AI demonstrates that many problems remain [173]. From a regulatory perspective, traditional medical device regulation was not designed for adaptive AI/ML technologies, and continual learning poses many challenges [174, 175]. The guidelines for regulatory approval of medical AI devices are not finalized in either the USA or EU, and new regulatory frameworks are being proposed [174, 176,177,178]. Post-market surveillance of approved medical AI devices is also needed [178, 179]. The validation requirements for AI algorithms that fall outside of regulatory frameworks, such as hospital developed AI, also need clarification [180].

Validation Examples from Radiology

The regulatory problems are readily apparent in recent studies of validated imaging products in the USA and Europe. Recent studies of FDA approved AI algorithms in medical devices based on publicly available summary information emphasize various validation problems. Of 130 medical devices with AI algorithms approved between 2015 and 2020, 126 devices were evaluated by the FDA using only retrospective studies, and 59 (45%) did not include the sample size [181]. Of the 130 approved devices, only 37 (28%) reported evaluation at more than one site [181]. In another study of 118 FDA approved AI algorithms across imaging modalities approved between 2008 and 2021, only 66 reported the sample size, with 45/66 (68%) having a sample size less than 500 patients [182]. Most FDA summary documents available to the public do not provide the demographics or details of the sample studied [182]. Of 100 European Conformity marked AI radiology products, of which 51 were also cleared by the FDA, only 36 had peer-reviewed evidence of efficacy [183•]. Of 237 studies obtained for the 100 products, 192 were retrospective, and only 71/237 (30%) included multicenter data [183•]. With considerable heterogeneity in deployment and pricing, most products brought to market in recent years, and with clinical impacts unclear, the authors concluded that “artificial intelligence in radiology is still in its infancy” [183•]. Validation and reporting standards must be improved to increase safety, and physician product knowledge and trust before AI is routinely used in clinical medicine.

Discussion

The promise of AI in medicine is real, but the current technical maturity of AI is low. AI is not routinely used in clinical practice. In the USA, knowledge of AI-related skills is not a standard requirement for employment in the healthcare field [184]. Between 2015 and 2018, only 1 in 1250 online job postings for skilled jobs in hospitals required some AI-related skills, lower than in other skilled industries [184]. It is important to have realistic expectations given the wide ranging problems confronting the successful adoption of AI in routine clinical medicine. The many complex technical, validation, regulatory, implementation, maintenance, and monitoring issues need to be solved carefully and rigorously. There needs to be a strong emphasis on the human–computer interface, understanding how the introduction of AI products will modify the workflow in specific clinical settings, and training for unexpected safety hazards. Physicians need education in AI fundamentals, and should be involved in the entire process of AI software development, implementation, training, and ongoing monitoring throughout the life of the system. Additionally, psychiatrists should be involved in understanding the behavioral issues related to automation bias and deskilling in medicine, as well as in the development of AI technology that predicts human behavior [185].

Many challenges for physicians to successfully augment decision making with AI are unique to medicine. Physicians must understand and trust AI sufficiently to incorporate the advice in the treatment of individual patients. The physician must interpret the AI prediction given the overall clinical context, including patient-specific characteristics, comorbid medical conditions, unique medication regimens, and socioeconomic issues. Today, the physician hears promises of accuracy of AI tools that may not be validated and sees explanations of AI tools that may not be clear. For example, the performance measures used to describe an ML algorithm may hide the uncertainty in the predictions [186]. This is especially relevant in psychiatry given the frequent use of categorical, probabilistic diagnoses in training data [186]. Another concern is that AI output may be plausible but incorrect and potentially dangerous for an individual patient [161]. Patients frequently have comorbid illnesses, yet the separate predictive algorithms being developed for each comorbid condition could provide conflicting advice [187]. The impacts of AI on the humanistic aspects of medicine, including the doctor-patient relationship, patient trust, and communications, need to be understood [188, 189]. AI technologies will become an important source of medical knowledge for physicians, but human inductive reasoning, situational awareness, and creative problem solving will remain fundamental for individual patient care, as exemplified by psychiatry [68, 163]. The successful deployment of AI in clinical medicine will require coordinated and ongoing efforts of physicians working with professionals with a wide range of skills from a number of disciplines.

Limitations

There are many limitations to this review. Technical details about the problems noted in AI software development, validation, and interpretability were not discussed. The risks of cyber attacks on AI systems, including ML vulnerability to adversarial attacks [190, 191], and the difficulty in detecting and tracing software errors were not mentioned [165]. Approaches to select appropriate tasks for AI in medicine, enhance integration of AI tools into EHR and other connected hospital systems, improve data quality, or for ongoing safety monitoring were not discussed. The wide range of collaborative skills needed for successful AI development and implementation were not included [158]. The many ethical [192, 193] and legal issues related to AI including fairness and inclusion, privacy, physician liability, algorithm failure, and vendor contract terms were not discussed [194, 195]. Patient perspectives of the use of AI in clinical medicine [196, 197], and the economic impacts of implementing AI in healthcare were not mentioned [198]. Finally, detailed recommendations to address the many noted problems are outside the scope of this review.

Conclusion

AI for clinical medicine has enormous potential but lacks technical maturity. The safe and effective implementation of AI technology to augment medical decision making poses wide-ranging challenges involving technical and human factors, regulatory issues, and safety risks. These challenges must be recognized and methodically addressed to maximize the benefits from AI technology in psychiatry in the future. It is important to set reasonable expectations. The solutions are complex and will take time to discover, develop, validate, and implement, but will lead to the safe and beneficial use of AI to augment medical decision making in psychiatry.