Keywords

Bringing Best Practices Together

With all components of understanding pitfalls and codifying corresponding best practices in place, we can now attempt a synthesis of all the material in this volume in a unified framework represented in Fig. 1. As can be seen, background knowledge on methods’ properties, human cognitive biases, and cases studies (left, grey background) informs the identification of pitfalls and the codification of corresponding best practices (blue background, middle). These address method development, model development, and components such as data design and encoding, model selection, error estimation, over fitting and under fitting avoidance, managing risk of model use are deployment and addressing regulatory and ELSI requirements. Taken together and when properly followed, these can enable the 7 dimensions of AI/ML trust, adoption and acceptance (on the right, green color).

Fig. 1
A block flow diagram represents the backgrounds of A I M L properties, human cognitive biases, and informative case studies that are mapped to pitfalls and best practices for robust method development. It exhibits a list of 7 trust and acceptance factors.

Synthesis of the book sources and material

The many pitfalls and best practices we presented throughout the present volume can be thought as belonging to 3 levels which are obvious to the reader by the context of their presentation and discussion:

  • Macro Level Guidance: corresponds to correct specification of high-level design, in which a problem is mapped to AI and ML modeling, broad objectives, high level principles.

  • Micro Level Guidance: corresponds to lower-level (but still significant) implementation details.

  • Meso Level Guidance: corresponds to conceptual and implementation elements in-between the high (macro) and low (micro) levels. It encompasses neither too broad (and abstract), nor too narrow (and minute) details.

Moreover it is important to appreciate that some of the pitfalls and best practices are more immutable and unlikely to change than others as the field of AI/ML progresses. It is useful to appreciate the differences in maturity levels across pitfalls and guidelines presented in this book and the literature.

Maturity level:

  • A mature designation denotes immutable, perennial or otherwise robust recommendation.

  • An evolving designation denotes a recommendation that is work-in-progress, evolving or otherwise subject to likely modifications in the future.

Finally, not all guidance has the same gravity. Some best practices are critical and should be always adhered to according to their impact level:

  • High Impact: corresponds to an action that must be addressed, otherwise serious pitfalls may follow with significant probability. If not addressed, because of special circumstances of a particular problem context, then an explicit rationale must be developed to support the exception.

  • Medium Impact: corresponds to recommendations that ideally should be addressed. If, however, resource constraints or other factors preclude addressing, then milder pitfalls may ensue, or serious pitfalls with very small probability. In this category we will also include recommendations that can be addressed at a later stage of developing the AI/ML solution.

In APPENDIX 3 of the present book, we collected all best practices of the present volume and characterized them according to maturity and importance. Indicative examples are given in the table below.

Best Practice

Maturity

Impact

10.1: [Context: model development and validation]. Deploy procedures that prevent, diagnose and remedy errors of overconfidence in, or over fitting of, models

MATURE

HIGH

10.2.6: [Context: benchmarking of methods] follow theoretically and empirically proven specifications of reference (i.e., prototypical or official use of employed method)

MATURE

MEDIUM

10.5.7: Models for individual patients: use dense time series data, leverage population models, search for and model abrupt distribution shifts of the individual (including learning and modeling shifts at the population level).

EVOLVING

MEDIUM

Notice that this synthesis is enabled by examining these recommendations across multiple stages and aspects of the AI/ML R&D process and thus the reader can appreciate this final consolidated view with the benefit of having accumulated diverse knowledge about the various scientific and technical aspects of health AI/ML addressed in the book’s chapters.

Open Problems, Unknowns & Future Directions

We discuss next, some of the open problems and future directions in the study of best practices for biomedical AI/ML.

Adapting best practices to novel or unanticipated contexts of use. “Rules are for the obedience of fools and the guidance of wise men”Footnote 1. This famous quote captures the notion that as user expertise grows the need for strict adherence to rules reduces.

For example, as we saw throughout the present volume, the context of use may influence the validity or application scope of a guideline. Every best practice recommendation is stated with a set of use cases and contexts in mind. It is important to consider if a particular R&D or AI/ML technology deployment project has special characteristics that may subtly or overtly affect the appropriateness of a particular guideline. See for example the discussion of genomics and overfitting in chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring Problems, and the Role of Best Practices”. It is up to the project team to establish that deviations from best practices, especially ones designated as critical in the above checklists is appropriate. Conversely, if a recommended best practice is assessed to be too lax for a problem at hand, it may be necessary to incorporate additional restrictions, safeguards, performance requirements etc. that go beyond the ones presented here.

Related to the above, newer technology may (and should) render manual safeguards of today, automatic, and indeed this is a trend in ML, for example many newer algorithms that incorporate multiple protections against overfitting, or sampling variation (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”). This is also the case for statistical techniques that incorporate regularization/shrinkage and do not require pre-processing data with PCA or other dimensionality reduction. Systems that are designed around protocols and stacks that implement best practices will also advance the ease of use and reduce the need for manual enforcement of best practices.

Sufficiency vs necessity and assumption-mitigating factors. Often techniques that are designed to work well under specific sufficient assumptions surprise the research community by working well in settings that violate these assumptions. Perhaps the most classic example in the history of data science is the success of linear models even when deployed in domains with non-linear data generating functions (for example, due to bias-variance reasons, see chapter “Foundations and Properties of AI/ML Systems”). Another example is the mitigation of the hardness of large scale causal models despite the fact that worst case computational complexity for even small networks is intractable (see chapter “Foundations and Properties of AI/ML Systems”). Benchmarks have shown that methods whose sufficient assumptions are violated, manage in some situations to outperform better–designed methods (for example, in very small samples simple univariate association strategies may outperform more complex modeling strategies; or in another example XOR parents of a response have non-zero first order effects if they are correlated, i.e., they do not exhibit the worst-case behavior of the textbook XOR function). These considerations suggest that data scientists should operate with an open mind and employ a plurality of techniques as these may yield good but unexpected results. They must walk the fine line between allowing for empirical happy surprises on one hand, and avoiding wishful thinking or not prioritizing methods according to their fit to the problem, on the other.

Ramifications of tampering with validated codes. It takes lengthy, costly and demanding efforts to build methods that meet specific performance criteria with guaranteed properties (as detailed in chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”). At the same time, changes to original specifications that on the surface may appear insignificant can have major effects on these properties and performance (see chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring Problems, and the Role of Best Practices” - text categorization benchmark study). This issue becomes of great relevance in light of open science frameworks in which anyone can access and modify codes and algorithms with potentially uncontrollable consequences to quality due to the modifications. Establishing unique identifiers for algorithm and code versions with associated properties, benchmarks and other performance properties, may serve as a possible solution to this problem.

Transparency of algorithms and codes is desirable in many ways especially for validating methods and tools, however it opens up the possibility for abuse, unintended damage or even “gaming” of models and systems. Black box methods and systems are utterly undesirable when not operating as intended or to the standard of safety and performance that generates trust and adoption. On the opposite side of things, black box systems that are “locked” to avoid tampering and degrading alterations may be desirable under many circumstances. Also as it has been advocated in the AI literature, if the well-validated statistical advantage of a black box model is so superior to the performance of the best transparent model, it may be impractical or even unethical to not use the best performing model. Navigating these tradeoffs is certainly a challenge.

Developing a culture that values and strives for performant and responsible AI/ML is of paramount importance for wide spread adoption of BPs. Such culture can be developed in key places: education (health science and professional health schools), ethics training, engagement with community, government, health systems, tech industry. This culture should be built around guiding principles with sound scientific basis and broad acceptance.

Over-engineering and over-regulating. As with every aspect of science and technology, if best practices are enforced in very prescriptive, bureaucratic or superficial ways lacking thoughtfulness, there is the danger to stifle innovation and render progress slow. The need to ensure safety and performance by adhering to best practices designed to support these goals has to be carefully balanced against the very real opportunity costs inherent in unnecessarily delaying deployment of useful AI/ML both in healthcare and the health sciences.

Need for evolving best practices systematically. Undoubtedly, the recommendations and codified best practices in this volume and elsewhere will evolve as the science and technology of AI/ML advance and as new use cases in health care and health science research emerge. It is important that this evolution is informed by prior generations of best practices and that the various stakeholders who will be called to adopt and advance the state of the art in BPs will do so without re-inventing the wheel. There is a body of knowledge in particular that we cannot imagine to require radical re-design or abandonment in any point in the foreseeable future. For example, we will always need to have health care models and decision support systems with precise goals guiding their design and deployment. We will always need reliable estimators of model performance. We will always need to design data capture, sampling and measurements in ways that support the modeling objectives. We will always need to manage the trade-offs of model bias and sampling variance. We will always need to pay particular attention to the distinction between causal and associational models and their strengths and limitations. We will always need to equip AI/ML models and systems with protective measures against operating outside their knowledge and safe performance boundaries. We will always need to ensure against unethical and biased operation of such technology. These are just a few examples of essentially immutable objectives stemming from fundamental laws of statistics and learning theory, computer science, statistical risk management, computability, and ethics.

Bypassing regulations by claiming exploratory intent. It is all too easy to disguise a decision model that guides user actions as one that merely advises or informs the user. This has been a problem throughout the history of health AI/ML. Regulation must address such abuses, otherwise they can render regulation perfunctory and ineffective but also because they can grossly distort the performance and safety requirements at the design stages of AI/ML systems.

Misaligned sentiment and technical reality. As we saw in chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring Problems, and the Role of Best Practices” a large distance between the reality of AI/ML capabilities and the over-promising of results led to multiple and deep crises in the history of the field. This danger is always present and should be taken very seriously. It is the position of the present book that rational and careful R&D and deployment facilitated by appropriate best practices can accelerate passage thru the precarious terrain of the hype cycle.

Need to conduct R&D with limited resources. For reasons of simplicity we presented feasibility and mission-critical development as discrete and mutually exclusive approaches. However even when mission-critical models and systems are sought, “ideal” development, validation and deployment of AI/ML not only is not always possible, but it could also be economically unwise. A phased approach where R&D can be abandoned once sufficient evidence against feasibility is gathered, is a more economically realistic approach. This phased model of feasibility → iterative development until mission-critical goals are met, is a worthy direction that is also congruent with established models of industrial R&D. However at this time the precise mechanisms to optimally manage such phased development in the health AI context, is an open problem.

We finally ask:

Do best practices need to be universally accepted? Should there be a single set of acceptable and/or effective best practices for health AI/ML? We can conceptualize health AI/ML BPs as comprising a necessary shared core criteria, and a component corresponding to additional sufficient criteria with multiple alternatives of equal outcomes. In other words to the extent that BPs abide by the laws of statistics, data science, and so forth, they may vary in various details (of sufficient criteria). What has to be present in all useful guidelines is an underlying shared core of necessary criteria. Finally, there is value in establishing BPs that further specialize the general rules to narrow fields of application with more or less restrictive requirements. We expect that such variations will become a topic of fruitful research.

Conclusion

It is our hope that the specific guidances presented here, especially when focused on persistent and immutable desiderata and laws of biomedical data science, will be a useful basis for both the growing success of biomedical AI/ML, and for assisting in the study of best practices as its own worthwhile subfield of AI/ML.

Key Concepts Discussed in This Chapter

Macro, Meso, and Micro Level Guidelines and Best Practices

BP Maturity levels: Mature vs Evolving

BP Impact levels: High vs Medium Impact

Key Messages Discussed in This Chapter

  • Bringing Best Practices together. From background knowledge on methods’ properties, to the identification of pitfalls and the codification of corresponding best practices to enable the 7 dimensions of AI/ML trust, adoption and acceptance.

  • A checklist that integrates and characterizes all discussed BPs (Appendix 3).

  • Best practices may need to be adapted to novel or unanticipated contexts of use.

  • In the future, newer technology may (and should) render manual safeguards automatic.

  • Mitigating factors can exist that can overcome lack of sufficient assumptions for correctness or other properties.

  • Tampering with validated codes may have unwanted ramifications.

  • There are pros and cons of transparent algorithms and codes vs Black Box technology.

  • Developing a culture that values and strives for performant, ethical and accountable AI/ML is of paramount importance for wide spread adoption of BPs

  • Over-engineering and over-regulating AI/ML are dangers that need be recognized and addressed.

  • Improving BPs is unavoidable but needs to be done systematically.

  • Bypassing regulations by claiming exploratory intent may hinder successful regulation and AI/ML based solution design.

  • Misaligned expectations and technical reality is a problem that may be mitigated by BPs.

  • Phased feasibility to iterative development may allow more economically efficient R&D.

  • Multiple sets of BPs that achieve equal outcomes are conceivable.

Classroom Assignments & Discussion Topics Chapter “Synthesis of Recommendations, Open Problems and the Study of Best Practices”

  1. 1.

    Choose 3 of the listed BPs in the book that you think are most likely to change in the future and explain why you chose those.

  2. 2.

    Choose 3 of the listed BPs in the book that you think are least likely to change in the future and explain why you chose those.

  3. 3.

    Give 2 examples where an unusual context of use may override a listed guideline of your choice.

  4. 4.

    Should systems that incorporate BPs give users the option to turn the BPs off? Discuss.

  5. 5.

    How can generative algorithm specifications enable safe alterations of core algorithms?

  6. 6.

    Provide 3 examples of algorithms that embed the following BPs: managing model complexity, managing sampling variance, differentiating between causal and predictive modeling.

  7. 7.

    Have you encountered positive and negative examples of attitudes toward AI/ML safety? Discuss.

  8. 8.

    Give an example of over-engineering AI/ML.

  9. 9.

    Give an example of over-regulating AI/ML.

  10. 10.

    Can you think of ways to ensure that a stated exploratory intent is genuine?

  11. 11.

    What generates in your view misaligned of societal sentiment about biomedical AI/ML with technical reality? Can you think of solutions?

  12. 12.

    Assume that two sets of BPs exist and give different guidance for a particular context or use case. How would you resolve this situation?