Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML

Aliferis, Constantin; Simon, Gyorgy

doi:10.1007/978-3-031-39355-6_1

Constantin Aliferis³ &
Gyorgy Simon³

Part of the book series: Health Informatics ((HI))

4261 Accesses
1 Citations

Abstract

In the opening chapter we first introduce essential concepts about Artificial Intelligence and Machine Learning (AI/ML) in Health Care and the Health Sciences (aka Biomedical AI/ML). We then provide a brief historical perspective of the field including highlights of achievements of Biomedical AI/ML, the various generations of AI/ML efforts, and the recent explosive interest in such methods and future growth expectations. We summarize how biomedical AI and ML differ from general-purpose AI/ML. We show that pitfalls and related lack of best practices undermine practice and potential of Biomedical AI/ML. We introduce high-level requirements for biomedical AI/ML and 7 dimensions of trust, acceptance and ultimately adoption, which serve as the driving principles of the present volume. We outline the contents of the volume, both overall and chapter-by-chapter, noting the interconnections. We discuss the intended audience, and differences from other AI/ML books. We finally discuss format, style/tone, and state a few important caveats and disclosures.

You have full access to this open access chapter, Download chapter PDF

Guiding principles for the responsible development of artificial intelligence tools for healthcare

Article Open access 01 April 2023

Artificial Intelligence and Healthcare Ethics

Artificial Intelligence in Medicine: Validation and Study Design

Keywords

What Is Machine Learning? Algorithms, Programs, and Models

A myth that was pervasive in earlier stages of the history of computing was that computers can only solve problems (or perform actions) that a human programmer had specifically instructed them how to tackle. As even the broad lay audience can appreciate circa 2023, computers equipped with Machine Learning (ML) capabilities can learn from data how to perform intelligent tasks and perform complicated problem solving on their own [1,2,3]. Whereas the ML algorithms are typically programmed by humans, once implemented, these types of software can interpret data in ways that far exceed the capabilities of their human creators, not just in terms of speed but also by making inferences that are qualitatively superior to humans, for example by avoiding human cognitive biases and blind spots and performing inferences that humans do not do at all or are not good at performing (e.g., pattern recognition in very high dimensional spaces sometimes in the 10⁶ variables scale or more) [4]. In addition, whereas ML programs are currently typically presented with data prepared by human operators/analysts, it is entirely possible (and in some cases routine) to collect data on their own, or instruct human operators to collect data needed for problem solving [5,6,7].

Definition

Machine Learning (ML) is the science and technology of computing systems that learn how to solve problems by analyzing data related to the problems.

To go into slightly more detail, ML algorithms implemented in ML programs and systems, use so-called training data from which they build problem-solving models. It is useful to understand these important concepts further since there is confusion among the non-technical audience (including biomedical scientists and healthcare providers and administrators).

A computer program is a set of instructions that a computer can understand and execute toward performing a task intended by the program [8]. For example, a program written in the language Python that instructs a personal computer with the ability to execute Python commands, how to sort a set of numbers into descending order.

A software computer system is a complex set of interconnected programs that perform a number of interrelated functions. For example, an Electronic Health Record (EHR) system comprises a set of programs and databases that manage patient data to support patient care, record actions for compliance, perform billing and reimbursement, etc.

A computer algorithm is a generalized (programming language-agnostic) set of computer instructions designed to solve a class of problems. Computer algorithms are presented in a form (so called pseudo-code [9]) that is geared towards being interpretable by humans. In contrast, computer programs are written in a computer language that is interpretable by computers. For example, the “quicksort” number sorting algorithm is a set of instructions, written in a format meant for human interpretation, that can be translated to any general-purpose computer programming language. Furthermore, the quicksort algorithm needs to be translated by a programmer into a programming language, a process known as implementing the algorithm, before it can be executed by a computer. In another ML example, the ID3 algorithm creates, from previously-diagnosed patient data, decision tree models that can be used for diagnosing new patients.

An AI/ML model is therefore a computable representation of some problem-solving domain so that when informed with a set of inputs describing specific instances of the problem space, outputs solutions to those. These models are created by hand by using AI knowledge and other knowledge engineering methods and tools, and in the case of ML fully automatically from training data [10].

Computer algorithms [9, 11] have a number of distinguishing characteristics from computer programs:

(a)
They (typically, and formally) not need be described in a specific programming language, but in pseudo-code, as previously explained.
(b)
They represent a potentially infinite set of programs that can be implemented in every applicable programming language and computing environment.
(c)
When properly constructed, they have well-defined properties that guarantee performance, error free (or error-acceptable) operation, generalizability etc. (more on this later in the book).
(d)
When properly implemented (i.e., translated to a specific programming language) they guarantee that the algorithm properties are imparted in the particular program that implements the algorithm.

The field of Design and Analysis of Algorithms studies the properties of algorithms (and associated data structures [9, 11] (i.e., ways to represent and organize data for storage, retrieval and other operations)) and methods to design algorithms for specific problems so that desired operating characteristics (e.g., speed, memory usage, accuracy etc.) are achieved.

Pitfall 1.1

Very commonly in the commercial healthcare space a computer program or system implements unspecified, undisclosed or insufficiently-analyzed algorithms, hence no-one knows what the properties of the program are.

In chapter “Foundations and Properties of AI/ML Systems” but also in several other places of the present volume, we will address the fundamental issue of guaranteed properties of AI/ML systems and best practices enforcing those.

Pitfall 1.2

In healthcare and the health sciences, clinical algorithms are often confused with computer algorithms (including ML algorithms). A clinical algorithm [12, 13] describes diagnostic, risk assessment, preventative, treatment or other actions needed to care for patients with specific diseases, usually in the context of evidence-based guideline-driven medicine. It can be written in human language or specialized computable languages. A clinical algorithm is a human-assistive decision model and is not an algorithm that can learn how to solve the problem from data. Finally, a model produced by a ML algorithm can serve as a clinical algorithm in a health care setting.

ML algorithms are therefore implemented in ML computer programs that when presented with training data, learn and output decision models. In chapters “An Appraisal of Operating Characteristics of Major Machine Learning Methods Applicable to Healthcare and Health Sciences”, and “Foundations of Causal Machine Learning” we will review major ML families of algorithms and describe the types of models they output. In several other chapters of the present volume we will discuss specific algorithms and models and their characteristics and optimal use.

Well-constructed ML models do have general applicability beyond the training data, otherwise they would be just a catalogue of past problem instances and their solutions, without the ability to be used for new problem instances. Machine Learning theory [14, 15] provides results and techniques that enable and ideally guarantee the generalization properties of ML models beyond the training data.

Artificial Intelligence (AI); Types of AI and ML Tasks; on the Pervasive Applicability of ML and AI

The language we adopted on ML algorithms as a means of solving problems, has a deeper significance as it relates to the definition of Artificial Intelligence. AI depending on the context, the era and the author, has been viewed as (a) the field of science and technology that investigates the creation of fully autonomous computer systems (i.e., “Intelligent Systems”); (b) exhibiting intelligence capabilities indistinguishable to those of humans (i.e., so-called “hard AI”); (c) providing the empirical means for putting forth and testing under controlled (computer lab) conditions theories of cognition; or (d) creating programs capable of solving hard (computational, mathematical, cognitive, decision, optimization and other inferential) problems [2]. Operationally we will adopt the following view on AI:

Definition

Artificial Intelligence (AI) is the science and technology of computing systems that can autonomously solve hard inferential problems.

Such problems historically have been associated with the prerequisite of “intelligence”.

From a perspective of organization of scientific fields and their relationships, ML is one of the fields of AI which, in turn, is a field of Computer Science (CS). At the same time, ML is a core part and arguably the most important component (along with statistics) of the nascent field of Data Science.

Definition

Data Science is the field of science and technology that studies the: (a) design and execution of data measurements, sampling/collection; (b) data representation and management, harmonization, secure storage and transmission; (c) analysis, interpretation, and (d) deployment of results in applied problem-solving settings.

Data Science spans and connects several fields including ML and statistics, as well as parts relevant to data sampling and modeling from applied mathematics, operations research, econometrics, psychometrics, decision sciences, information science, scientometrics and bibliometrics, statistical genetics and genomics, etc. [16, 17].

Figure 1 shows the relationship among Computer Science, AI and ML. As can be seen in the figure, both AI and ML are very diverse and developed, comprising many types of research, systems, algorithms, and applications.

A classification chart of computer science. Artificial intelligence has 11 subjects that include the theory of computation and databases. Machine learning includes robotics and computational linguistics. Subfields include supervised learning, feature selection, and anomaly or outlier detection. — **Fig. 1**

An important pitfall (the importance of which will become abundantly obvious in this volume) is to consider one very narrow subfield, for example Deep Learning, as the totality or the main focus/armamentarium of all of ML and AI, or as another example, considering ML as the totality of AI. This has serious consequences as we will see in this book because it prevents users of AI and ML to have the right perspective in which a plurality of methods can be brought to bear on solving problems by matching the right method to the problems at hand.

Pitfall 1.3

Very commonly novice advocates of ML and AI, or vendors promoting certain products will present the whole field as being about one narrow technology or a small set of tools, ignoring the broader spectrum of available options that can solve the problem at hand. The many options available however, have hugely varying performance characteristics that need careful consideration as no single class of methods is suitable for all biomedical problems.

In the present book we place a heavy emphasis on data-driven forms versus expert-knowledge-driven AI, for the following reasons: first, modern health AI is predominantly data driven and will continue to be so in the foreseeable future. Second, ML is vastly more scalable than expert knowledge-driven construction of AI systems. Third, ML has many pitfalls and intricacies that require addressing. Fourth, ML is also an important component of other forms of AI (e.g., NLP, computer vision, robotics). Finally, the highlighted pitfalls and best practices are often useful for both ML and other forms of AI.

Readers not already deeply familiar with AI/M applications in the health sciences and care delivery are likely to be surprised by the extraordinarily wide range of applications of these fields. This pervasive applicability of ML and AI is not accidental, however. It can be immediately grasped once one considers that both health sciences and care are fundamentally designed to pursue discovery and application of predictive and causal knowledge. Predictive modeling encompasses diagnosis, prognosis, forecasting and general pattern recognition [1,2,3,4]. Causal modeling [18,19,20] seeks to discover cause-effect relationships, to quantify their effects, and to choose among various interventions those that will maximize some desired outcome. It encompasses discovery of laws of biology, therapeutics, understanding the factors that drive system and patient-level outcomes such as development, treatment and prevention of disease at the individual level. At the level of the system of care, they encompass intervention on factors that affect quality of care, costs, reimbursements, patient experience and all other desiderata of health systems [21].

Neither General AI/ML, Nor Biomedical AI/ML Are New. Highlights of Achievements of Biomedical AI/ML

The general public became aware of AI and ML as a viable technology in very recent years as a result of the emergence of commercial offerings backed by established corporations as well as numerous startups catering to healthcare systems and health research organizations. The scope of adoption and widespread use of AI and ML, is currently breathtaking and includes: autonomous vehicle navigation (cars, airplanes, industrial robots), cybersecurity, fraud and spam detection, financial applications, internet and e-commerce applications, manufacturing, games, education, legal, and numerous other applications [22].

In healthcare and the health sciences, examples of successful applications include automated diagnosis, prognosis, treatment selection (using as inputs: coded clinical data, text reports, images, omics data, etc.) [23]; discovery of gene mutations causing specific forms of cancer or other disease [24]; precision medicine tests (e.g., genes’ expression level patterns determining response to a treatment used for treatment selection) [25]; automated evaluation of scientific papers to determine whether the research design was good [26]; annotating genomes and other genetics applications [27]; predicting tertiary & quaternary protein structure from amino acid sequence [28]; predicting drug-drug and drug-food interactions [29]; medical imaging [30] and numerous other applications which we will cover in depth in the present volume.

The advent of big data in particular, in healthcare and population health (e.g., EHR, sensor, environmental, social networks) and the health sciences (e.g., genomics, proteomics, metabolomics, microbiomics, copy number variation, and other “bulk” and single cell “omics” data, deep sequencing databases, research consortia data, etc.) has simultaneously demanded the development of high-quality scalable analysis methods and strongly incentivized their deployment at scale [31]. In the last 20 years there is a synergistic co-evolution of big data generation/capture and ML-driven analysis and discovery with key themes of modern health science and health care such as: rational drug development [32], modern post-sequencing era genomics precision and personalized medicine [25], learning health systems and care cost/quality/experience improvements [33], to mention just some of the key developments that depend on ML and AI and that are foci of the present work.

To give a sense of the immense scope and rapid maturity with respect to health outcomes the following searches^{Footnote 1} return:

((“outcomes” or “health services”) and “machine learning”)	→ 6255 results (most since 2015)
((“outcomes” or “health services”) and “machine learning”) and “systematic review“	→ 240 results

These systematic reviews (not cited explicitly here for space, but readily retrievable from PubMed with the stated queries) represent broad application areas with significant and diverse bodies of work. They include predictive, prognostic, diagnostic and etiologic outcomes modeling in:

Neurosurgical outcomes, depression, obesity, surgical outcomes, EEG classification, dermatology, urology outcomes, suicide prevention, Covid mortality, autoimmune disease outcomes, stroke, various cancers, dementias, orthopedic surgery, heart failure outcomes, pregnancy outcomes, imaging and radiomics analysis, sepsis in the ICU, managing covid-19, assessing physician competence, hematopoietic stem cell transplantation (HSCT), various infectious diseases, cardiac surgery, management and treatment of burns, infant pain evaluation, management of heart failure patients, bipolar disorder, degenerative cervical and lumbar spine disease, cardiovascular outcomes from wearable data, psychosocial outcomes in acquired brain injury, acute gastrointestinal bleeding, personalized dosing of heparin, Parkinson’s disease, genetic prediction of psychiatric disorders, diabetes, clinical deterioration in hospitalized patients, community-based primary health care, palliative and end-of-life care, hypertension, graft failure following kidney transplantation, outcomes in neonatal intensive care units, degenerative spine surgery, predicting fatal and serious injury crashes from driver crash and offense history data, health care spending, extraction of data from randomized trials, improving medication adherence in hypertensive patients, neighborhood-level risk factors, gait analysis, wearable inertial sensors to quantify everyday life motor activity in people with mobility impairments, outcome prediction of medical litigation, rheumatic and musculoskeletal diseases, analysis of patient online reviews, chronic low back pain, risk of readmission and several other topics.

PubMed is also informative on relative literature volumes pertaining to AI/ML methods and applications, and their trends^{Footnote 2}:

Figure 2 illustrates the explosive growth of ML and AI through the number of Pubmed publications over the years between 1990 and 2022. The blue line represents the number of publications for AI [MeSH] (left) and Machine Learning [Keyword] (right); the black dotted line represents the scaled number of total citations (from any field). The rate of growth in AI and ML far outpaces the overall growth rate of publications since 2015.

2 line graphs plot publication counts versus years. The count for A I increases with fluctuations till 2015, and then increases steeply to end above 25000 in the year 2020. The count of machine learning increases exponentially from 0 in 1990, to end above 25000 in the year 2020. — **Fig. 2**

In Fig. 3, we show how the growth in health AI is distributed over some of its subfields. Machine Learning enjoys most of the growth, with Natural Language Processing (NLP) and Image Analysis following closely. Modern advances in Machine Learning, Deep Learning in particular, serve as an enabling technology for both of these subfields. Other subfields, such as Knowledge Representation exhibited a more modest growth, while Expert Systems appears to have experienced negative growth since they are being replaced by ML. We need to remember that PubMed focuses on biomedicine.

2 line graphs plot publication counts versus years. The count of machine learning increases exponentially to end above 25000 in the year 2020. The counts for N L P, image analysis, and knowledge representations increase with fluctuations to above 1500, 2500, and 80, Expert systems decrease to 160. — **Fig. 3**

In terms of absolute volume of publications, the following tables provide relevant data (Table 1):

Table 1 Health AI/ML publication volumes

Full size table

These results are to some degree an artifact of the indexing of articles employed by Pubmed. For example:

“clustering” (which is a form of ML)	→ 489,442 results
“Artificial neural network” (Mesh term)	→ 23,746 results
But:
“Deep learning” (Mesh term)	→ 40,377 results

Caveat: Deep Learning is a special type of artificial neural network, which entails that if indexed properly the entries indexed by “artificial neural network” should be a strict superset of the entries indexed by “Deep Learning”.

As to articles with key types of ML, in addition to the ones above we see:

“Decision tree” (Mesh)	→ 23,206 results
“Support vector machine” (Mesh)	→ 22,675
“Genetic algorithm” (Mesh)	→ 90,728 results
“Random forest” (Mesh)	→ 23,357 results
“Bayesian network” (Mesh)	→ 10,076 results
“Bayesian classifier” (Mesh)	→ 12,503 results
“Granger causality” (Mesh)	→ 3810 results

With regards to major types of AI in addition to the ML ones mentioned we see:

“Autonomous robot”	→ 2743 results (most since 2005)
“Expert systems”	→ 20,627 results (most since 1990)
“Knowledge representation”	→ 12,526 results (most since 1990)
“Semantic network”	→ 6482 results (most since 2005)
“Natural language processing”	→ 9659 results (most since 2005)

The exponential-rate growth of most of these methods in the biomedical literature started and took place for the most part in the last 15-to 30 years. It is worth noting that in the field of Biomedical Informatics (aka Health Informatics) seminal publications in ML and AI appeared as early as in 1959, however. The 1959 article by Ledley and Lusted [34] is particularly important since it anticipated many of the key themes and methods that were rediscovered (and in some cases ignored) by modern commercial vendors and academic or industry adopters of biomedical AI/ML 63 years later.

Similarly, the 1961 article by Warner et al. is [35] is a seminal paper for the field of Medical Informatics and describes a ML-based approach to improving diagnosis in a significant disease, later expanded to many other diseases in the 60s all the way to the 80s by these and other pioneering investigators.

Another important seminal early work, this time in human expert knowledge- driven AI was the work by Miller et al. [36]. This notable AI system employed heuristic knowledge representation and reasoning that managed to perform at a hard reasoning task (challenging diagnostic cases across all of internal medicine) at a level that matched or in some cases exceeded expert physicians. This system was emblematic of the efforts in the 70 s and the 80 s to create AI that was driven by extracting and representing in computable form human expert problem solving. These efforts were followed by newer ML-based systems with the advent of more capable ML algorithms and representations taking advantage of increasing amounts of training data, such as Bayesian Networks and other sophisticated Bayesian classifiers [37, 38], early multi-layered artificial neural networks [39, 40], decision tree learners and other ML algorithms [1,2,3,4] that vastly outperformed in ease of use, cost-effectiveness and accuracy early ML algorithms and human expert knowledge.

The “Perfect Storm” for Biomedical AI/ML

The ability to capture massive Big Data (as indicated above) in the 2000s and onward, fueled the explosive application and refinements in kernel-based nonlinear classifiers (e.g., SVMs) [1,2,3,4], boosting algorithms, causal discovery and inference algorithms [18,19,20], deep artificial neural networks [39, 40, 41], significant extensions to decision trees (Random Forests [42]), regularized versions of statistical regression algorithms [43], and other methods that could now manage tens, hundreds and in some cases millions of variables with modest compute requirements and most importantly with extreme tolerance to low sample sizes without overfitting [44]. These methods exhibited properties that classical statistical science and practice previously considered impossible [4, 14]. Some types of newer algorithms also had the ability to discover causality without experiments which have also been considered previously impossible [18,19,20] and newer scalable causal algorithms that made application to high dimensional data as well as scalable hybrid predictive and causal modeling feasible [45,46,47,48,48]. This “perfect storm” for biomedical AI/ML led to its current cycle of explosive growth. It is not surprising that the above developments in general AI and ML are closely associated with the work of 9 Turing award recipients (Marvin Minsky, John McCarthy, Herbert A. Simon, Edward Feigenbaum, Raj Reddy, Judea Pearl, Yoshua Bengio, Geoffrey Hinton, Yan Le Cun), and 7 Nobel Prize recipients in economics: (Herbert A. Simon, Daniel Kahneman, Clive Granger, Thomas A. Sargent, Christopher A. Sims, Joshua Angirst, Guido Imbens) solidifying thus the scientific credibility and immense importance of these methods.

Yet, despite all of this scientific activity and accomplishments (>three million entries in Google Scholar mentioning ML and > three million mentioning AI as of 2023), these fields have been presented to the general public and the non-experts, as either entirely new, or they have been presented as invented recently in the laboratories of a handful of commercial companies. This brings us to another important pitfall:

Pitfall 1.4

The field of general and biomedical AI and ML is not a new one. Ignoring the vast literature and re-inventing the wheel in some cases, fails to take advantage of a wealth of very substantial prior work that can inform effective, safe and cost-effective use. Methods that have undergone rigorous development, analysis and validation over many years have in general better-understood properties, better performance robustness, and better operating safety characteristics than newer less well-developed methods.

Best Practice 1.1

When considering development or application of AI/ML, ensure that it is informed by the well-developed and evaluated pre-existing science and technology.

Differentiation of Biomedical AI and ML from General-Purpose AI/ML

Another important pitfall we will address in this volume is the distinction between general purpose AI & ML versus biomedically-tailored AI & ML.

Pitfall 1.5

Biomedical AI and ML have specific requirements and adaptations tailored to the goals of healthcare and of health sciences discovery. AI and ML devised and tested in unrelated fields have very different properties and do not ensure the goals of healthcare and health science applications.

A summary of the adaptations and differentiation, to be elaborated further in this volume, is as follows:

Biomedical AI/ML:

(a)
Is driven by, and has strong interactions with clinical objectives, health economics, and healthcare delivery within specific health systems.
(b)
Requires the ability to handle very large dimensionalities (i.e., number of variables).
(c)
Requires the ability to handle very small sample sizes without overfitting.
(d)
Must be equipped with the ability to discover and model causality, since it is often necessary to estimate effects of interventions.
(e)
Requires specialized data operations and the ability to handle diverse data types including clinical coded data, text, imaging, biomolecular data, and combinations.
(f)
Places great emphasis on accuracy, cost-effectiveness, quality control and de-risking.

All of these requirements will be addressed in detail in the present volume.

Future Potential of Biomedical AI/ML

As widespread and rapidly growing biomedical AI/ML is, it has potential for orders of magnitude more growth. For example, compared to classical biostatistics, AI/ML has a smaller data science footprint in biomedical literature as revealed by the following PubMed searches:

“Cox regression” (Mesh)	→ 105,385 results
“Chi square test” (Mesh)	→ 116,546 results
“ANOVA” (Mesh)	→ 522,350 results
“Regression” (Mesh)	→ 1,011,918 results

AI/ML methods are rapidly substituting complex inferential statistics and/or are extending them in substantial ways, however. There are many signals for the forthcoming growth of biomedical AI/ML. We mention a few strong indicators:

(a)
In the domain of molecular profiling for precision medicine [25], only just a handful of such profiles have been brought to market so far, although, >170,000 molecular signature papers have been published (many of them showing feasibility of clinical signatures). The number of patient-touching precision tests expected to be in use at any given time in the future, if estimated as the combination of (diseases * drugs), exceeds 100,000.
(b)
Other areas where massive biomedical AI/ML growth is expected include health systems outcomes improvement [21] with hundreds of thousands of AI/ML models conceivable to be developed and deployed in the future, assuming that at least one model will be deployed for every major decision/disease/outcome combination that is affecting patients, units and systems.
(c)
Similarly in the space of precision clinical trials [25] currently much less than 1% of all trials are precision trials and migrating to this model of clinical therapeutics validation will necessitate application of AI/ML at scale across the research domain (>20,000 new large new trials annually).
(d)
In radiology, we can safely expect a massive transition to computer-assisted (and in some cases fully automated) interpretation of clinical or research imaging, across many health science and care domains.
(e)
In single-cell transcriptomics and other omics (including “multiplexed” combinations) and their spatiotemporal extensions, the use of AI/ML is absolutely necessitated by the immense dimensionalities (> 5000 cells * 10,000 molecular probes with current technology yields dimensionalities of > 50 million variables per patient/research subject). Single-cell omics technologies are the successor of bulk deep sequencing technologies (themselves the successor of microarray technologies) and according to all indications, will be driving biological discovery for decades to come. If these precursors are an indication, then 100,000 s of applications of AI/ML single-cell omics are to be expected [49, 50].
(f)
The vast majority of models referenced in the hundreds of systematic reviews (covering thousands of modeling studies) mentioned in section “Neither General AI/ML, nor Biomedical AI/ML are New. Highlights of Achievements of Biomedical AI/ML”, are pre-clinical or otherwise feasibility efforts as stated in the corresponding systematic reviews. These reviews found very promising results but identified that the models have not yet reached the clinically mature stages needed for broad deployment. Closing this gap will undoubtedly be a large part of the future of health AI/ML.

Pitfalls and Related Lack of Best Practices Undermine Biomedical AI/ML. AI/ML Trust and Acceptance

The strong and sustained trends outlined above in the literature and commercial AI/ML, suggest that AI/ML will grow to be a science and technology that permanently and irrevocably enables progress across all aspects of health science research and health care delivery. There is an ethical and utilitarian necessity therefore for this science and technology to be executed with an emphasis on meeting performance, safety, and cost-effectiveness requirements.

Performance requirements entail that AI/ML has to be accurate and minimize false positive and false negative results. For example, the massive application of AI/ML if allowed to generate false positives will drown the research system in noise, rendering the space of scientific investigation a destructively low signal-to-noise environment. Avoidable false negatives due to poorly thought AI/ML represents the space of corresponding opportunity cost.

Safety requirements entail that AI/ML systems applied in clinical care settings as well as preventative policy and other public health settings should not allow for any avoidable errors of either wrong treatment/intervention decisions that incur risk to patients, populations, or systems of care. They should also not allow errors of failing to identify opportunities to improve patient/human subject health (for example, diagnosis of treatable diseases, opportunities to improve cost and quality of the system of care) as such failures translate to decreased life expectancy/quality of life of individuals, populations and negatively affect the health systems that care for them.

Cost-effectiveness requirements entail that AI/ML systems applied in care settings as well as health science discovery should not be wasteful in either time-to-results, or compute requirements, or sample size requirements, or cost of decisions. The costs of such inefficiencies can quickly become unmanageable.

Perspectives on building trust, adoption, and acceptance of technology by humans (as individuals or at the society level) are diverse and encompass performance, economic, legal, accountability, ethical, psychological, social and other factors [51,52,53,54,55,56,57,58,59]. Operationally we frame the above requirements from the perspective of stakeholders using a Biomedical AI/ML trust and acceptance framework, comprising the following 7 dimensions:

1.
Scientific and Technical Trust and Acceptance. AI/MLmodels must be accurate at deployment (e.g., low error rate, not falling outside their boundaries of strong performance (known as their “knowledge cliff”)).
2.
Health System Trust and Acceptance AI/ML models must be safe, cost-effective and well-embedded in systems of health with clear benefits and without unexpected/unacceptable risks, disruptions or other negative consequences.
3.
System-of-science Trust and Acceptance. AI/ML models must be safe and cost-effective to operate in the system of science without unexpected/unacceptable risks and consequences.
4.
Beneficiary Trust and Acceptance. AI/ML models must be accepted by patients and human subjects individually and at the community level.
5.
Delivery and Operator Trust and Acceptance. AI/ML models must be accepted by clinicians and scientists.
6.
Regulatory Trust and Acceptance. AI/ML models must be compliant to applicable laws and approved by regulatory bodies.
7.
Ethical Trust and Acceptance. AI/ML models must be non-discriminatory and must promote health equity and social justice related to health science and care (e.g., by being non-discriminatory on the basis of race, socioeconomic factors, gender, etc.).

In their 2022 program solicitation (NSF 22–502), entitled “National AI Research Institutes Accelerating Research, Transforming Society, and Growing the American Workforce”, the National Science Foundation (NSF) acknowledged that identifying, prioritizing, and satisfying the fundamental attributes that render an AI trustworthy are open research challenges. Notably the program described trustworthiness through examples from other areas of mature technology such as automobiles or electric lighting. These systems are trustworthy, “because they are reliable, predictable, governed by rigorous and measurable standards, and provide the expected benefits. Facilitated by basic knowledge of their operation, we are familiar with common faults and how to address them, and there is infrastructure to deal with problems we cannot handle ourselves.” It’s a compelling proposition that health-related AI should have similar characteristics.

The whole purpose of the present volume therefore is to outline a set of preferred practical requirements and methods (“Best Practices’) that will move us forward to biomedical AI/ML that avoids pitfalls and achieves the 7 dimensions of trust, acceptance and eventual adoption. In order to justify the requirements and assemble/build the proposed best practices we will also need to introduce a body of necessary technical background knowledge.

Intended Purpose and Audience of the Book

AI & ML are extremely popular topics and numerous books are available, generally falling into four categories:

1.
Hands-on instructional texts on how to build a general-purpose AI system, e.g., using a particular Python software package. Such books are not specific to health care or health sciences and their specific problems; nor do they provide a strong conceptual understanding of how different models work and how this relates to their applicability to different health problems.
2.
General purpose data mining, AI, and ML textbooks. Such books do not relate to health care or health science and do not give advice on how to develop models specifically for health care or any other area: they focus on a very narrow aspect of model development. Moreover they do not differentiate between feasibility and exploratory analysis from the much more mission-critical clinical and other high-stakes modelling settings that are so prominent in healthcare and the health sciences.
3.
Health care analytics and the promise of AI in health care. Most works in this category focus on conventional (reporting and compliance) analytics. A few address the new capabilities brought by AI/ML. They are not designed to provide the reader with a deep understanding of what the (primarily) technical challenges are in health care AI, or what the pitfalls are and how specifically and systematically to avoid them.
4.
Bioinformatics and genomics discussing AI/ML approaches in that context. These are technical books that typically do not focus on systematic methodologies for ensuring appropriateness of various AI/ML methods, or their methodological underpinnings.

From our review of the literature there are more than 100 textbooks in 2023 in press in the above categories. We view them as very useful background for broad fundamentals and/or context of use: from such books readers can learn basic concepts of general machine learning, and can also learn how to build certain types of models; our present effort however focuses on knowledge and practices specific to how health science, clinical, translational, and healthcare AI/ML systems differ from the general-purpose AI/ML. The book aspires to impart comprehensive and in-depth knowledge on how to build robust and safe models for the high-stakes settings in health science and care, and to evaluate the strengths and weaknesses of such models produced by others. We will cover both general (mostly immutable) scientific principles as well as specific technical guidance that may evolve over time.

More precisely, we envisioned the present volume to be the first book in the field to provide guidance for the following concepts/topics:

1.
The critical differences between general-purpose AI & ML and medically-applicable AI & ML.
2.
Building models that can be applied with minimal risk in high-stakes settings including clinical applications, healthcare system optimization, and discovery of clinical modalities.
3.
Models that integrate multi-level, multi-modal clinical and molecular data.
4.
The importance of data design and post-modelling safeguards for high-stakes applications.
5.
Common limitations and remedies of efforts (commercial and academic) in the field.
6.
In-depth presentation of not just predictive but also causal and hybrid causal-predictive methods.
7.
A comprehensive summary and critique of operating characteristics of all major AI & ML methods.

This volume emphasizes the need and methods for biomedical AI/ML to:

1.
Be intentional, with well-defined and meaningful goals and metrics of success.
2.
Effectively manage risk for errors that may affect adversely the health of patients, the effectiveness of health systems, and the effectiveness of the system of science.
3.
Operate in real-life (as opposed to idealized and simplified theoretical) health care as well as in health science discovery ecosystems.
4.
Develop within a lifecycle that starts from problem statements and needs all the way to successful deployment and continuous iterative improvement.
5.
Prevent and overcome the fundamental dangers of over fitting and under fitting as well over confidence in models and under performance of models.
6.
Have known properties that guarantee performance and safety.
7.
Be based on sophisticated and appropriate data designs.
8.
Be differentiated along the levels of systems/stacks, protocols, algorithms, models.

We adopt an interdisciplinary perspective, using and integrating methods from Data Science, Computer science (Machine Learning, AI, predictive analytics), Statistics, Epidemiology (study design), Clinical Decision Support, Bioinformatics, Clinical and Health Informatics, Genomics, Learning Health Systems, and Precision and Personalized Medicine.

Our intended audience comprises all stakeholders to the healthcare and health science ecosystems: (a) Applied and research Health Data Scientists working in industry, academia, and healthcare. (b) Clinicians/Professionals/Practitioners who are called on to evaluate, select, and use AI&ML based decision support. (c) Healthcare and translational (e.g. pharmaceutics and biotechnology) industry leaders/ administrators including but not limited to IT leaders who wish to evaluate and deploy competing technologies in medical AI&ML. (d) Educators and Students in informatics, ML & AI, health economics, health business administration, and data science. (e) Funding agency officers. (f) Journals and their editors. (g) Regulatory agency officers. And (h) Community members, representatives and advocates.

We elected to make this book an open access one, ensuring that all members of our intended audience can access this volume without financial restrictions.

Outline of the Book: Style, Format, and How to Read

The book is organized in three parts (with a total of 18 chapters): Foundations, Modelling, and Implementation. Each chapter typically covers several of the following: technical didactic exposition, case studies (of success and failure varieties), related pitfalls discussion, best practices addressing the pitfalls and serving the trust principles, along with literature references and occasional discussion thereof. We also provide brief chapter abstracts (at the start of each chapter), assignments for classroom use, and recapitulation of concepts, definitions, pitfalls and Best Practices (at the end of each chapter).

Educators may wish to use the book in whole or in part as classroom textbook. Features supporting classroom use include:

1.
Consistent structure and tone to the chapters. The two main authors have written the majority of the material and have co-authored or edited the contributing chapters to harmonize the content and style across the volume.
2.
Practice questions, discussion topics and assignments. Some of those are more conceptual and open-ended (e.g. appropriate for less technical learners) and some are more technology-oriented (e.g. targeting learners who need to develop technical knowledge and skills).
3.
Comprehensive coverage of the topic, not just the methods that the authors have invented, have used, or prefer.
4.
In the future we intend to provide an “official” answer key to the assignments and discussion topics of this volume.

Because our intended audience is very diverse, we make every effort to use plain language with minimal jargon and to keep mathematical, statistical and computer science technical details at a minimum. This does not mean that we shy away from presenting formulas, algorithms, and theorems. However, when we do so, we present them only when they are necessary for making sense of the Pitfalls/Best Practices in discussion. We also sought to use the simplest language possible that does not sacrifice validity. We also introduce background we think is required to understand these technical elements and emphasize the intuition and their practical consequences behind them.

The style and level of detail has been ground-tested on our teaching these concepts (for a combined 30+ years) in a variety of settings and audiences (e.g., from undergrad college interns to professional programmers, to graduate students in data science fields, to medical residents, to health sciences faculty, and to national tutorials with mixed health care and health science audiences). As is expected, our writing reflects our own formal training in these fields (spanning 27 years combined). More importantly, both main authors of the present volume are working scientists who have led and are active in many R&D method/technology and applications projects. These have occurred in the health sciences domain (mostly funded by the NIH and the NSF) but also in industry and in health care contexts. These experiences have provided us with a wealth of knowledge about the roadblocks that our intended audience routinely faces, and the ways to overcome them.

At the end, of course, the reader will decide if the approach taken here is as effective as we hope it will be. We caution that audiences with strong technical backgrounds may find the text “hiding” some technical details. We advise these readers to explore the ample references for more technical depth, and to focus their reading of the book on applied aspects that are not covered at all or are not synthesized sufficiently in the primary technical literature.

Audiences without or with incomplete technical backgrounds may find some concepts challenging at first read. Unless otherwise noted, we advise this type of reader to not skip the scientific and technological principles underlying ML/AI, since these are critical for successful use in high stakes tasks and environments.

With regards to the book assignments, we revisit and incrementally enrich and deepen many of them as new knowledge is provided by the various chapters. Readers should address them with the knowledge gained up until the chapter they are encountered.

Finally we recommend the independent reader to read the chapters in sequence (possibly only skimming material that the reader has already mastered elsewhere). We made every effort to cross-reference in each chapter concepts with all other parts of the book where they are discussed so even an out-of-sequence reading should be free of confusion.

For in-classroom use, the class instructor is trusted to determine the right components to emphasize or omit, and in the right sequence for her class objectives and learners’ background and needs. The incremental structure of assignments and discussion topics is valuable for developing gradually an increasingly sophisticated understanding of recurring themes and topics. It can also serve as a record of the students’ progress in mastering the related body of knowledge and their ability to integrate and evaluate the material. This will be disrupted unavoidably in any out-of-sequence reading, however, and the instructor has to make adjustments to the assignments in such cases.

We also note that all assignments are motivated by real-life examples of methods development and application challenges. They can be traced to literature and case studies in the public domain as well as to our personal experience as working scientists, teachers, advisors, consultants and administrators. Whenever we felt there was possibility to breach upon privacy or reputation of third parties, we omitted specific references to technology and persons, in all other cases we name methods, products, and scientists, especially when credit was due for important discoveries or other scientific and technological contribution acknowledgment.

Caveats and Disclosures: Sourcing Best Practices

Where Do Best Practices Come from?

The realistic answer is that, circa 2023, biomedical AI/ML Best Practices are not to be found in one place, stated as such, and having fully complete and immutable status. This volume, to the best of our knowledge, is the first book to strive for that goal. Our recommendations originate from a variety of sources and are characterized by different levels of (a) maturity/validation, (b) breadth of applicability, and (c) technical clarity and depth. We have thus considered and included in the present volume the following sources for the presented Best Practices:

1.
Published guidelines stated as such, for example the PubMed search (“artificial intelligence” or “machine learning”) and “best practices“(e.g., [60]) yields 217 results, several of which contain proposed best practices (of various degrees of validation and usefulness as we will see in subsequent chapters). In some cases important Best Practices and guidelines are contained in articles with a broader scope, for example, guidance issued by the biometrics division of the NCI [61].
2.
Implicit but clear findings and recommendations published by quality control consortia (e.g., [62]).
3.
Broad and well-designed benchmark studies that demonstrate the appropriateness and effectiveness of various algorithms in specific settings (e.g., [62, 63]).
4.
AI/ML competitions (properly designed to prevent biases) e.g., [64].
5.
Criteria used in meta-analytic and systematic review studies to assess quality, risk of bias etc. (see for example chapter “Reporting standards, Certification/Accreditation & Reproducibility”).
6.
Published reporting, regulatory, and certification standards and requirements (e.g., [65]).
7.
Theoretical properties of AI/ML algorithms, protocols and related methods that directly suggest proper and improper usage (see for example chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal of Operating Characteristics of Major Machine Learning Methods Applicable to Healthcare and Health Sciences”, and “Introduction to Causal Inference and Causal Structure Discovery”).
8.
Case studies that inform generalizable types of errors and suggest strategies to avoid them (see for example chapter “Lessons Learnt from Historical Failures, Limitations and Success of Health AI/ML. Enduring Problems and the Role of Best Practices”).
9.
Literature reports that have focused on identifying specific types of errors or modeling/analysis problems and have provided reusable approaches for avoiding or minimizing them (e.g., [66]).

In general this volume avoids offering guidance based on the authors’ preferred workflows or methods unless these are falling in one of the above categories.

A key value proposition of the present work therefore is that we have assembled, reviewed, critically analyzed, and synthesized a plurality of sources to inform pitfalls and related best ways currently known for improving AI/ML quality, performance, effectiveness and safety.

We caution the reader that like every other cutting-edge field of scientific endeavor, this is work in progress and some of the currently known Best Practices in ML/AI will undoubtedly improve and be revised as new methods come into play and the field deepens and widens its knowledge. We welcome reader feedback and criticism and we will make every effort to appraise and incorporate all useful suggestions in future editions. See also “Final Synthesis of Recommendations” for discussion about future evolution of Best Practices.

Outline of the Book: Contents Summary by Part and Chapter

Part I: Foundations

This present chapter entitled “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the need for Best Practices enabling Trust in AI and ML”), aims to provide introductory concepts about the field, to motivate the need for best practices in biomedical AI and ML, and to map out the book’s scope and contents so that readers are well oriented. A small set of high-level pitfalls and guidelines are also included.

Chapter “Foundations and Properties of AI/ML Systems” provides a broad introduction to the foundations of health AI and ML systems and includes: (1) Theoretical properties and formal vs heuristic systems; practical implications of complexity for system tractability. (2) Foundations of AI including logics and symbolic vs non-symbolic AI, Reasoning with Uncertainty, AI/ML programming languages. (3) Foundations of Machine Learning Theory.

Chapter “An Appraisal of Operating Characteristics of Major Machine Learning Methods Applicable to Healthcare and Health Sciences” provides an outline of how each method works, and in addition we summarize the intended uses, the usual way it is employed in practice, and its known and unknown properties. Readers who have not delved into ML before, will find a useful introduction and review of key methods. Readers who may already know about some or all of these methods will gain additional insights as we critically revisit the key concepts and add to their prior knowledge summary guidance on whether and when each technique is applicable or preferred (or not) in healthcare and health science problem solving.

Chapter “Introduction to Causal Machine Learning” covers the important dimension of causality. The vast majority of texts in biomedical AI/ML focuses on predictive modeling and does not address causal methods, their requirements and properties. Yet these are essential for determining and assisting patient-level or healthcare-level interventions toward improving outcomes of interest. Causal methods are also indispensable for discovery in the health sciences.

Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” outlines a comprehensive process, governing all steps from analysis and problem domain needs specification, to creation and validation of AI/ML methods that can address them. The stages are explained and grounded in many existing methods. The process discussed equates to a generalizable Best Practice guideline applicable across all of AI/ML. An equally important use of this Best Practice is as a guide for understanding and evaluating any ML/AI technology under consideration for adoption for a particular problem domain.

Part II: Modelling

Chapter “The Process and Lifecycle of a Clinical-Grade AI/ML Model” introduces the notion of “clinical-grade” models and contrasts such models with feasibility, exploratory, or pre-clinical ones. The main tenet of the chapter is that AI/ML systems and models must be designed and deployed in a manner that is aware of, and seamlessly integrated in healthcare systems or discovery processes (for healthcare and health science discovery, respectively). The steps outlined span from requirements engineering to deployment, monitoring and iterative development and continuous improvement. They also emphasize contextual factors that influence success.

Chapter “Data Design for Biomedical AI/ML” addresses the critical aspect of data (or research) design and related best practices. This endeavor is foundational to the success of AI/ML for both clinical care and scientific discovery. Yet to the extent of our review of the literature, a systematic and in-depth treatment of this most important aspect receives little attention in the ML literature. In this chapter (a) we present common designs (e.g., retrospective, cohort, case/control, EHR, time series, RCT, hybrid, etc.) and implications of design choices for the success of modelling; (b) we discuss common data biases (e.g., selection bias, assertion bias, confounding bias, Simpson’s paradox, etc.).

Chapter “Data Preparation, Transforms, Quality, and Management” introduces guidance for performing data preparations so that the goals of modeling are effectively and efficiently accomplished. It also addresses data quality, mapping, feature engineering, data transformations, clinical and research data warehousing and management.

Chapter “Model Selection and Evaluation” addresses best practices for finding models that are accurate, and generalize well. Estimation of the generalization error is also addressed both in terms of error estimator procedures and their interaction with model selection as well as in terms of error metrics and their effect on analysis. In addition to general-purpose performance metrics, this chapter also discusses aspects of model evaluation that are unique to biomedical applications, such as evaluating clinical efficacy, the suitability of a model for clinical decision support, and health economic evaluations.

Chapter “Overfitting, Underfitting and Model Overconfidence and Under-performance in Machine Learning and AI” makes a deep dive into overfitting and under fitting which are arguably two of the most far-reaching and impactful challenges in AI/ML with high-dimensional data, modest or small sample sizes, and modern high-capacity learners. Avoiding over and under-fitted analyses and models is critical for ensuring high generalization performance. In modern ML/AI practice these factors are typically interacting with error estimator procedures and model selection, as well as with sampling and reporting biases and thus are considered together in context. These concepts are also closely related to statistical significance and scientific reproducibility. We examine several common scenarios where over confidence in model performance and/or model under performance occur as well as recommended practices for preventing, testing and correcting them.

Chapter “From ‘Human vs Machine’ to ‘Human with Machine’” addresses: (a) empirical evaluations of healthcare and health science AI/ML decision-making. (b) Empirical comparisons of computer vs human decision making in health sciences and health care. (c) Important human cognitive biases that lead to decision errors. (d) Summary comparison of human vs computer strengths and limitations that may manifest as errors in medical practice or science discovery settings. (e) Practical considerations in constructing hybrid computer-human problem-solving systems.

Chapter “Lessons Learned from Historical Failures, Limitations and Successes of Health AI/ML. Enduring Problems, and the Role of Best Practices” covers a variety of case studies relevant to best practices. Examples include: the infamous “AI winters”; overfitting; using methods not built to purpose; over-estimating the value and potential of early and heuristic technology; developing AI that is disconnected from real-life needs and application contexts; over-interpreting or misinterpreting results from learning theory; failures/shortcomings of literature including the persistence of incorrect findings; failures/shortcomings of modeling protocols, data and evaluation designs; high profile science failures; factors that may render guidelines themselves problematic. These case studies in most cases were followed by improved technology that overcame the limitations. The case studies reinforce, and demonstrate the value of rigorous, science-driven practices for addressing enduring and new challenges.

Chapter “Characterizing and Managing the Risk of AI/ML Models in Clinical and Organizational Application” covers practical methods for reviewing the face validity of AI/ML models, and characterizing and managing risk of such models at development and at deployment stages. This chapter also briefly discusses broader methods and practices for detecting and correcting issues with ML modeling and the emerging concept of debugging ML models and analyses.

Part III: Implementation

Chapter “Considerations for Specialized Health AI/ML Modelling and Applications: NLP” looks into the field- and task-specific best practices for the domain of health NLP.

Chapter “Considerations for Specialized Health AI/ML Modelling and Applications: Imaging – Through the perspective of Dermatology” looks into field and task-specific best practices in the specialized domain of Imaging (with a dermatology focus).

Chapter “Regulatory Aspects and Ethical, Legal, and Societal Implications (ELSI)” reviews the regulation of AI/ML models, the risk management principles underlying international regulations of clinical AI/ML, discusses the conditions under which AI/ML models in the U.S. are regulated by the Food and Drug Administration (FDA), and reviews FDA’s Good Machine Learning Practice (GMLP) principles. In its second part, the chapter provides an introduction to the nascent field of biomedical AI ethics, covering general AI ELSI studies, AI/ML racial bias, and AI/ML health equity principles. The chapter discusses (and gives illustrative examples) of the importance of causality and equivalence classes for practical detection of racial bias in models. It concludes with a series of recommended best practices for promoting health equity and reducing health disparities via the design and use of health AI/ML.

Chapter “Reporting Standards, Certification/Accreditation & Reproducibility” covers the interrelated topics enhancing the quality safety and reproducibility of clinical AI/ML via (a) reporting standards; (b) recent efforts for accrediting health care provider organizations for AI readiness and maturity; (c) professional certification; and (d) education and related accreditation of educational programs in data science and biomedical informatics, specific to AI/ML.

Chapter “Final Synthesis of Recommendations” presents a consolidated view of the identified pitfalls and recommended practices across the book. We differentiate between macro-, meso- and micro-levels of pitfalls and corresponding best practices- roughly corresponding to high-level principles, concrete differentiations of the above and granular/detailed tools and techniques for implementation. We discuss the non-uniqueness of best practice frameworks and several open problems. The continued development and dissemination of Best Practices for biomedical AI/ML is certain to become in the years to come a field of inquiry with significant growth and value.

Key Concepts Discussed in Chapter “ Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Artificial Intelligence (AI) and Machine Learning (ML)

Data Science

Computer program

Computer system

Computer algorithm

AI/ML model

Data Science

Performance requirements

Safety requirements

Cost-effectiveness requirements

Trust, acceptance, and adoption

Key Messages Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

1.
AI/ML are long standing disciplines with millions of published articles since the 1960s and with several Turing and Nobel awards linked to them.
2.
Biomedical AI/ML has a long history and extensive literature behind them also starting in the 1960s. They have recently exploded in the literature in adoption for discovery and care and as their own fields of study.
3.
AI/ML are applied broadly in science and health care because they relate to extremely broad classes of prediction/pattern recognition and causal modeling and problem solving tasks.
4.
Biomedical AI/ML has several distinct requirements than general-purpose AI/ML.
5.
AI/ML Algorithms, programs and systems must inspire and guarantee trust in their safety, effectiveness and cost effectiveness. Best Practices must be developed, shared and followed to enable trust and acceptance.
6.
Known properties are essential for AI/ML trust.
7.
Currently known Best Practices originate from a variety of sources, have different levels of maturity or validation and will undoubtedly expand and improve in the future.

Pitfalls Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Pitfall 1.1: Unspecified, undisclosed or insufficiently-analyzed algorithms.

Pitfall 1.2: In healthcare and health sciences, clinical algorithms are often confused with computer algorithms.

Pitfall 1.3: Viewing the whole field as being about one narrow technology or a small set of tools, ignoring the broader spectrum of available options.

Pitfall 1.4: Ignoring the vast literature or “re-inventing the wheel”.

Pitfall 1.5: Ignoring the specific requirements and adaptations tailored to the goals of healthcare and of health sciences discovery.

Best Practices Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Best Practice 1.1 When considering development or application of AI/ML ensure that it is informed by well-developed and evaluated existing science and technology.

Classroom Assignments and Discussion Topics chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

1.
If science is self-correcting via reproducibility studies, what are the dangers/downsides to producing AI/ML systems/methods and related articles with a high proportion of false results?
2.
Identify from news sources and business publications articles about past industry failures in health AI/ML. Summarize and draw your conclusions about how to remedy and avoid such problems.
3.
What, in your view, is the ideal relationship (i.e., rules of engagement and assignment of responsibilities/foci) of industry and academia in developing and delivering health AI/ML?
4.
What are areas where health AI/ML cannot reach human problem solving? What about the reverse?
5.
The so-called No Free Lunch Theorem (NFLT) states (in simplified language) that all ML and more broadly all AI optimization methods are equally accurate over all problems on average. Discuss the implications for choice of AI/ML methods in practical use cases.
6.
“It is not the tool but the craftsman”. Does this maxim apply to health AI/ML?
7.
How would you go about identifying and measuring/documenting the impact that AI/ML has had on specific health science discoveries?
8.
Is AI confined to computer systems? Can other artificial intelligent agents such as corporations be viewed as AI? Discuss implications of such a broader view.
9.
Construct a “pyramid of evidence” for health AI/ML similar to the one used in evidence based care practice. Consider two pyramids: one focusing on clinical healthcare and another on health science discovery.
10.
You are part of a university/hospital evaluation committee for a vendor offering a patient-clinical trial matching AI product. Your institution strongly needs to improve the patient-trial matching process to increase trial success and efficiency metrics.

The sales team makes the statement that “this is a completely innovative AI/ML product; nothing like this exists in the market and there is no similar literature; we cannot at this time provide theoretical or empirical accuracy analysis, however you are welcome to try out our product for free for a limited time and decide if it helpful to you”. The product is fairly expensive (multi $ million license fees over 5 years covering >1000 trials steady-state).

What would be your concerns based on these statements? Would you be in position of making an institutional buy/not buy recommendation?
11.
A company has launched a major national marketing campaign across health provider systems for a new AI/ML healthcare product based on its success on playing backgammon, reading and analyzing backgammon playing books and human games, extracting novel winning strategies from matches, answering questions about backgammon, and teaching backgammon to human players.

How relevant is this impressive AI track record to health care? How would you go about determining relevance to health care AI/ML? How your reasoning would change if the product was not based on success in backgammon but success in identifying oil and gas deposits? How about success in financial investments?
12.
Your university-affiliated hospital wishes to increase early diagnosis of cognitive decline across the population it serves. You are tasked to choose between the following AI/ML technologies/tools:
1. (a)
  AI/ML tool A guarantees optimal predictivity in the sample limit in distributions that are multivariate normal.
2. (b)
  AI/ML tool B has no known properties but is has been shown to be very accurate in several datasets for microarray cancer-vs-normal classification.
3. (c)
  AI/ML tool C is a commercial offshoot of a tool that was fairly accurate in early (pre-trauma) diagnosis of PTSD.
4. (d)
  AI/ML tool D is an application running on a ground-breaking quantum computing platform (Quantum computing is an exciting and frontier technology that many believe has potential to make AI/ML with hugely improved capabilities in the future).
5. (e)
  AI/ML tool E runs on a novel massively parallel cloud computing platform capable of Zettascale performance.
What are your thoughts about these options?
13.
The same question as #12 but with the following additional data:
1. (a)
  AI/ML tool A sales reps are very professional, friendly and open to offering deep discounts.
2. (b)
  AI/ML tool B is offered by a company co-founded by a widely-respected Nobel laureate.
3. (c)
  AI/Ml tool C is offered by a vendor with which your organization has a successful and long relationship.
4. (d)
  AI/Ml tool D is part of a university initiative to develop thought leadership in quantum computing.
5. (e)
  AI/Ml tool E will provide patient-specific results in 1 picosecond or less.
How does this additional information influences your assessment?

Notes

1.
Conducted on June 2, 2022
2.
conducted on June 2, 2022, and using Mesh index terms when available
3.
To be further elaborated later in the book, including related Best Practices.
4.
Several of these and similar topics will be clarified and elaborated upon in subsequent chapters. However we recommend to class instructors and self-learners to get a first-pass evaluation of where the reader/classroom is (attitude, knowledge, experience) with regards to such problems.

References

Hart PE, Stork DG, Duda RO. Pattern classification. Hoboken: Wiley; 2000.
Google Scholar
Russell, S.J., 2010. Artificial intelligence a modern approach. Pearson Education, Inc.
Google Scholar
Weiss SM, Kulikowski CA. Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc.; 1991.
Google Scholar
Statnikov A. A gentle introduction to support vector machines in biomedicine: theory and methods, vol. 1. world scientific; 2011.
Book Google Scholar
Sverchkov Y, Craven M. A review of active learning approaches to experimental design for uncovering biological networks. PLoS Comput Biol. 2017;13(6):e1005466.
Article Google Scholar
Statnikov A, Ma S, Henaff M, Lytkin N, Efstathiadis E, Peskin ER, Aliferis CF. Ultra-scalable and efficient methods for hybrid observational and experimental local causal pathway discovery. J Mach Learn Res. 2015;16(1):3219–67.
Google Scholar
Guyon I, Cawley GC, Dror G, Lemaire V. Results of the active learning challenge. In: Active learning and experimental design workshop in conjunction with AISTATS 2010. JMLR Workshop and Conference Proceedings; 2011, April. p. 19–45.
Google Scholar
Tanenbaum AS. Structured computer organization. Prentice Hall; 1984.
Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. MIT press; 2022.
Google Scholar
Brookshear, J.G. Computer science: An overview. Benjamin-Cummings Publishing Co., Inc; 1991.
Google Scholar
Sedgewick R. Algorithms in c++, parts 1–4: fundamentals, data structure, sorting, searching. Pearson Education; 1998.
Google Scholar
Margolis CZ. Uses of clinical algorithms. JAMA. 1983;249(5):627–32.
Article CAS Google Scholar
Grimshaw J, Russell I. Achieving health gain through clinical guidelines. I: developing scientifically valid guidelines. Qual Health Care. 1993;2(4):243–8.
Article CAS Google Scholar
Vapnik, V. The nature of statistical learning theory. Springer science & business media. 1999.
Google Scholar
Kearns MJ, Vazirani U. An introduction to computational learning theory. MIT press; 1994.
Book Google Scholar
Donoho D. 50 years of data science. J Comput Graph Stat. 2017;26(4):745–66.
Article Google Scholar
Cao L. Data science: a comprehensive overview. ACM Comput Surv. 2017;50(3):1–42.
Article CAS Google Scholar
Spirtes P, Glymour CN, Scheines R, Heckerman D. Causation, prediction, and search. MIT press; 2000.
Google Scholar
Glymour CN, Cooper GF, editors. Computation, causation, and discovery. AAAI Press; 1999.
Google Scholar
Pearl J. Causality. Cambridge university press; 2009.
Book Google Scholar
Roski J, Bo-Linn GW, Andrews TA. Creating value in health care through big data: opportunities and policy implications. Health Aff. 2014;33(7):1115–22.
Article Google Scholar
https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence.
Cohen TA, Patel VL, Shortliffe EH, editors. Intelligent Systems in Medicine and Health: the role of AI. Springer Nature; 2022.
Google Scholar
Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV. Machine learning in genome-wide association studies. Genet Epidemiol. 2009;33(S1):S51–7.
Article Google Scholar
Adam T, Aliferis C. Personalized and Precision Medicine Informatics. Health Informatics Series. Basel, Springer Nature Switzerland. 2020.
Google Scholar
Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc. 2005;12(2):207–16.
Article Google Scholar
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16(6):321–32.
Article CAS Google Scholar
Cheng J, Tegge AN, Baldi P. Machine learning methods for protein structure prediction. IEEE Rev Biomed Eng. 2008;1:41–9.
Article Google Scholar
Ryu JY, Kim HU, Lee SY. Deep learning improves prediction of drug–drug and drug–food interactions. Proc Natl Acad Sci. 2018;115(18):E4304–11.
Article CAS Google Scholar
Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine learning for medical imaging. Radiographics. 2017;37(2):505–15.
Article Google Scholar
Andreu-Perez J, Poon CC, Merrifield RD, Wong ST, Yang GZ. Big data for health. IEEE J Biomed Health Inform. 2015;19(4):1193–208.
Article Google Scholar
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18(6):463–77.
Article CAS Google Scholar
Olsen L, Aisner D, McGinnis JM. The learning healthcare system: workshop summary. Institute of Medicine (US). National Academies Press (US); 2007. ISBN 978-0-309-10300-8.
Google Scholar
Ledley RS, Lusted LB. Reasoning foundations of medical diagnosis: symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science. 1959;130(3366):9–21.
Article CAS Google Scholar
Warner HR, Toronto AF, Veasey LG, Stephenson R. A mathematical approach to medical diagnosis: application to congenital heart disease. JAMA. 1961;177(3):177–83.
Article CAS Google Scholar
Miller RA, Pople HE Jr, Myers JD. Internist-I, an experimental computer-based diagnostic consultant for general internal medicine. N Engl J Med. 1982;307(8):468–76.
Article CAS Google Scholar
Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann; 1988.
Google Scholar
Shwe MA, Middleton B, Heckerman DE, Henrion M, Horvitz EJ, Lehmann HP, Cooper GF. Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base. Methods Inf Med. 1991;30(04):241–55.
Article CAS Google Scholar
Rumelhart DE, McClelland JL, PDP Research Group. Parallel distributed processing, vol. 1. New York: IEEE; 1988. p. 354–62.
Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Article CAS Google Scholar
Saba L, Biswas M, Kuppili V, Godia EC, Suri HS, Edla DR, Omerzu T, Laird JR, Khanna NN, Mavrogeni S, Protogerou A. The present and future of deep learning in radiology. Eur J Radiol. 2019;114:14–24.
Article Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Google Scholar
Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation. J Mach Learn Res. 2010;11(1):171–234.
Google Scholar
Tsamardinos I, Brown LE, Aliferis CF. The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn. 2006;65(1):31–78.
Article Google Scholar
Tsamardinos I, Aliferis CF, Statnikov AR, Statnikov E. Algorithms for large scale Markov blanket discovery, vol. 2. FLAIRS conference; 2003. p. 376–80.
Google Scholar
Tsamardinos I, Aliferis CF, Statnikov A. Time and sample efficient discovery of Markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining; 2003, August. p. 673–8.
Chapter Google Scholar
Aliferis CF, Tsamardinos I, Statnikov A. HITON: a novel Markov blanket algorithm for optimal variable selection. In: AMIA annual symposium proceedings, vol. 2003. American Medical Informatics Association; 2003. p. 21.
Google Scholar
Wang X, Fan J. Spatiotemporal molecular medicine: a new era of clinical and translational medicine. Clin Transl Med. 2021;11(1):e294.
Article CAS Google Scholar
Wu Y, Cheng Y, Wang X, Fan J, Gao Q. Spatial omics: navigating to the golden era of cancer research. Clin Transl Med. 2022;12(1):e696.
Article Google Scholar
Glikson E, Woolley AW. Human trust in artificial intelligence: review of empirical research. Acad Manag Ann. 2020;14(2):627–60.
Article Google Scholar
Hengstler M, Enkel E, Duelli S. Applied artificial intelligence and trust—the case of autonomous vehicles and medical assistance devices. Technol Forecast Soc Chang. 2016;105:105–20.
Article Google Scholar
Siau K, Wang W. Building trust in artificial intelligence, machine learning, and robotics. Cut Bus Technol J. 2018;31(2):47–53.
Google Scholar
Winfield AF, Jirotka M. Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos Trans R Soc A Math Phys Eng Sci. 2018;376(2133):20180085.
Article Google Scholar
Jacovi A, Marasović A, Miller T, Goldberg Y. Formalizing trust in artificial intelligence: prerequisites, causes and goals of human trust in AI. In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency; 2021. p. 624–35.
Chapter Google Scholar
Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: focus on clinicians. J Med Internet Res. 2020;22(6):e15154.
Article Google Scholar
Matheny M, Israni ST, Ahmed M, Whicher D. Artificial intelligence in health care: the hope, the hype, the promise, the peril. Washington, DC: National Academy of Medicine; 2019.
Book Google Scholar
Rigby MJ. Ethical dimensions of using artificial intelligence in health care. AMA J Ethics. 2019;21(2):121–4.
Article Google Scholar
Bates DW, Auerbach A, Schulam P, Wright A, Saria S. Reporting and implementing interventions involving machine learning and artificial intelligence. Ann Intern Med. 2020;172(11_Supplement):S137–44.
Article Google Scholar
Makarov VA, Stouch T, Allgood B, Willis CD, Lynch N. Best practices for artificial intelligence in life sciences research. Drug Discov Today. 2021;26(5):1107–10.
Article CAS Google Scholar
Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007;99(2):147–57.
Article Google Scholar
Shi L, Campbell G, Jones WD, et al. The MicroArray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28(8):827–38.
Article CAS Google Scholar
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005;21(5):631–43.
Article CAS Google Scholar
Guyon I, Aliferis C, Cooper G, Elisseeff A, Pellet JP, Spirtes P, Statnikov A. Design and analysis of the causation and prediction challenge. In: Causation and prediction challenge. PMLR; 2008. p. 1–33.
Google Scholar
https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices.
Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell FE Jr. Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS One. 2009;4(3):e4922.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Constantin Aliferis & Gyorgy Simon

Authors

Constantin Aliferis
View author publications
You can also search for this author in PubMed Google Scholar
Gyorgy Simon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Constantin Aliferis .

Editor information

Editors and Affiliations

Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Gyorgy J. Simon
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Constantin Aliferis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aliferis, C., Simon, G. (2024). Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML. In: Simon, G.J., Aliferis, C. (eds) Artificial Intelligence and Machine Learning in Health Care and Medical Sciences. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-031-39355-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-39355-6_1
Published: 05 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39354-9
Online ISBN: 978-3-031-39355-6
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics

Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML

Abstract

Similar content being viewed by others

Guiding principles for the responsible development of artificial intelligence tools for healthcare

Artificial Intelligence and Healthcare Ethics

Artificial Intelligence in Medicine: Validation and Study Design

Keywords

What Is Machine Learning? Algorithms, Programs, and Models

Definition

Pitfall 1.1

Pitfall 1.2

Artificial Intelligence (AI); Types of AI and ML Tasks; on the Pervasive Applicability of ML and AI

Definition

Definition

Pitfall 1.3

Neither General AI/ML, Nor Biomedical AI/ML Are New. Highlights of Achievements of Biomedical AI/ML

The “Perfect Storm” for Biomedical AI/ML

Pitfall 1.4

Best Practice 1.1

Differentiation of Biomedical AI and ML from General-Purpose AI/ML

Pitfall 1.5

Future Potential of Biomedical AI/ML

Pitfalls and Related Lack of Best Practices Undermine Biomedical AI/ML. AI/ML Trust and Acceptance

Intended Purpose and Audience of the Book

Outline of the Book: Style, Format, and How to Read

Caveats and Disclosures: Sourcing Best Practices

Where Do Best Practices Come from?

Outline of the Book: Contents Summary by Part and Chapter

Part I: Foundations

Part II: Modelling

Part III: Implementation

Key Concepts Discussed in Chapter “ Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Key Messages Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Pitfalls Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Best Practices Discussed in Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Classroom Assignments and Discussion Topics chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation