Introduction

Machine learning (ML) technology promises to automate, speed up or improve medical processes. A large number of institutions and companies are ambitiously working on fulfilling this promise spanning tasks such as medical image classification [1], segmentation [2] or reconstruction [3], protein structure prediction [4] and electrocardiography interpretation [5], among othersFootnote 1. However, the deployment of machine learning for health (ML4H) tools into real-world applications has been slow because existing approval processes [6] may not account for the particular failure modes and risks that accompany (ML) technology [7,8,9,10,11]. Certain changes to image data that may not change the decision of a human expert can completely alter the output of an image classification [12] or regression [13, 14] model. Model performance estimates are often not valid for the types of varying input distribution that can occur during real world deployment [15,16,17]. The decision heuristics a model learns can differ from the heuristics we may expect a human to use [1, 18,19,20], and model predictions may come with ill-calibrated statements of confidence [21,22,23] or no estimate of uncertainty altogether [24]. Developers proposing new ML4H technologies sometimes promise to match or even surpass the performance of existing methods [25] yet the reality is often more complicated. Classical ML performance evaluation does not automatically translate to clinical utility as examples from large diabetic retinopathy projects [26] or Covid-19 diagnosis illustrate [27]. The reliable and integrated management of these risks remains an open scientific and practical hurdle.

In order to overcome this hurdle, we envision a framework of algorithm auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare. In this editorial we give a brief summary of ongoing work towards that vision from our open collective of collaborators. Many of the considerations presented here originate from a consensus finding effort by the International Telecommunication Union (ITU) and World Health Organization (WHO) which started in 2018 as the Focus Group on Artificial Intelligence for Health (FG-AI4H) [28].

We are convinced that success on this path heavily depends on practical feedback. Auditing processes that are developed on paper have to be put to the test to ensure that they translate to utility in the actual auditing practice [29]. That is why we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal (see the Call for Participation for more detailsFootnote 2). The special issue will provide a platform for the submission, discussion and publication of audit methods and reports. The resulting compendium is intended to be a useful resource for users, developers, vendors and auditors of ML4H systems to manage and mitigate their particular risks.

ML4H Algorithm Auditing & Quality Control

From a bird’s eye view, many ML tools share a set of core components comprising data, an ML-model and its outputs, as visualized in Fig. 1A. The typical ML product life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring (see Fig. 1B). Feedback loops between stages, for example from product validation back to development, are commonplaceFootnote 3.

An audit entails a detailed assessment of an ML4H tool at one or more of the ML life cycle steps. It can be carried out to anticipate, monitor, or retrospectively review operations of the tool [30, 31]. The audit output should consist of a comprehensive standardized report that can be used by different stakeholders to efficiently communicate the tool’s strengths and limitations [29]. We envision a process by which an independent body, for example appointed by a government, carries out the audit using the methods and tools outlined below. Further, they can also be used by manufacturers and researchers themselves to carry out internal quality control [32]. In either scenario, the assessment is carried out with respect to a dynamic set of technical, clinical and regulatory considerations (see Fig. 1C) that depend on the concrete ML technology and the intended use of the tool. Audit teams should thus comprise expertise in all these dimensions and have to be able to synthesize related requirements across disciplines. In the following, we list a selection of considerations for all three of these auditing dimensions, tools that can be used to aid the auditing process as well as the role so called trial audits can play in advancing ML4H quality control.

Fig. 1
figure 1

Process overview. A: Most ML tools share a set of core components comprising data, a ML-model and its outputs B: The typical ML life cycle goes through stages of planning, development, validation and, potentially, deployment under appropriate monitoring C: An ML4H audit is carried out with respect to a dynamic set of technical, clinical and regulatory considerations that depend on the concrete ML technology and the intended use of the tool

Auditing Dimensions

The technical validation of an ML4H tool comprises the application of data and ML model quality assessment methods to detect possible failure modes in the model’s behavior. These include model-oriented metrics, such as predictive performance, robustness [33, 34], interpretability [1, 35], disparity [36] or uncertainty [13, 24, 37] but also data-oriented metrics related to sample size determination [38], sparseness [39], bias [40] distribution mismatch [41, 42] and label quality [7]. Rigorous statistical analysis of the model metrics is a common pitfall in both research and industry, and thus plays an important role during technical validation [43]. FG-AI4H has formulated a standardized quality assessment framework based on existing good practices [44,45,46] and provides practical guidance and examples for performing technical validation audits on three ML4H tools [29].

Clinical Evaluation comprises an “ongoing procedure to collect, appraise and analyse clinical data pertaining to a medical device and to analyse whether there is sufficient clinical evidence to confirm compliance with relevant essential requirements for safety and performance when using the device according to the manufacturer’s instructions for use” [47]. The EQUATOR-network, including STARD-AI [48], CONSORT-AI [49] and SPIRIT-AI [50], as well as different scientific journals and associations [51,52,53,54], have developed guidelines for the design, implementation, reporting and evaluation of AI interventions in various study designs. Key concerns are whether the ML4H tool delivers utility in clinical pathways, how cost-effective the clinician-tool interaction is [55] and whether it provides the desired benefits for the intended users [56]. To demonstrate reliable performance, it is important to look beyond common machine learning performance statistics such as accuracy and to evaluate in addition whether the ML4H tool is suited to the clinical setting in which it will be used; for example, whether the training and test data represent patient populations that are similar to the intended use population [7, 57] and whether the output translates to medically meaningful parameters [58].

Regulatory Assessment comprises the systematic evaluation of ML4H tools with respect to the applicable regulatory requirements found in laws (MDR [59], IVDR [60], 21 CFR [61], among others), to international standards (such as IEC 62304 [62], IEC 62366-1 [63] and ISO 14971 [64]), to guidelines by regulatory bodies (for example FDA [65], IMDRF [66]) or to guidelines and drafts by other organizations (for example AAMI [67] or European Commission [68]). Such guidance is of practical concern for stakeholders in the ML4H ecosystem including manufacturers (e.g. product managers, developers, developers and data scientists, quality and regulatory affairs managers) and for regulatory bodies (authorities, notified bodies). The FG-AI4H has identified and critically reviewed general yet fundamental regulatory considerations related to ML4H. This overview of regulatory considerations assessment have been converted into specific and verifiable requirements and subsequently published as a comprehensive assessment checklist entitled “Good practices for health applications of machine learning: Considerations for manufacturers and regulators” [45] which covers the entire life cycle outlined in 1B at a higher resolution. It includes checklist items which should be given high priority in the presence of limited time - an important practical constraint for real-world audits. Examples and comments give further guidance to users. New regulatory developments, such as predetermined change control plans [69], imply faster software update cycles and potentially more frequent audits. Hence, good tooling can become an important means to make effective as well as efficient audits possible.

Auditing Tools

The auditing process can be supported by appropriate tools to make it more targeted and time-efficient. This can include process and requirements descriptions, as mentioned above [44, 45, 56], which help to manage dynamic workflows that may vary by use case and ML technology. It also includes reporting templates to present the audit results in a standardized way for the communication between different stakeholders. [29, 70]. In addition, the nature of ML4H tools, as primarily software that interacts with data, lends itself to the application of test automation and simulations for the purpose of auditing. This requires software tools which can handle custom evaluation scripts, the flexible processing of different ML4H model formats and data modalities as well as security protocols that protect intellectual property and sensitive patient information [71]. We are working with open source frameworks such as EvalAI [72] and MLflow [73] to develop solutions for automated auditingFootnote 4, federated auditing in remote teamsFootnote 5 and automated report creation. Our first demo platform is available via http://health.aiaudit.org/Footnote 6 and hosted on ITU provisioned infrastructure. While quantitative performance measures can already be provided, it is essential to also offer qualitative measures. This is realized by requiring the users to fill out a standardized questionnaire [74]. Quantitative and qualitative performance results are then provided to the users as a comprehensive and standardized report card [70].

Trial Audits

We are convinced that success on the path towards a framework for algorithm auditing and quality control depends heavily on practical feedback. The development and refinement of auditing processes should routinely be accompanied by trial audits. In trial audits, draft processes and standards are applied to ML4H tools. The purpose of such an exercise is to ensure that auditing processes developed on paper translate to utility in actual auditing practice [29]. In order to facilitate the implementation of trial audits, we are introducing the special issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal. We welcome contributions pertaining to methods, tools, reports or open challenges in ML4H auditing.

Outlook

The materials summarized above bear testimony to the initial progress that has been made towards the creation of frameworks for ML4H algorithm auditing and quality control. Nevertheless, new challenges emerge as we collectively pull at the complex fabric that ML4H systems are.

From the perspective of technical validation, the identification of factors which bias or deteriorate algorithmic performance is often constrained by the absence of relevant metadata. For example, the measurement device types (and related acquisition parameters) used to produce the validation inputs should be available in order to validate if the model performance is robust under device type changes. This problem can be alleviated by identifying and routinely recording this information during data acquisition.

For clinical evaluation, future considerations include extending and refining the specific requirements related to how the clinical effectiveness of a tool should be monitored after implementation of the algorithm and with ongoing monitoring [59]. This also requires agreement over the clear and clinically useful procedures to obtain ground truth annotations. It might be necessary to refine the ML algorithm to the target population, if demographics or clinical character are different from training settings or if medical guidelines for diagnostics or treatment have changed [75]. Therefore, in order for these insights to be effective it is imperative that auditors exhibit a solid understanding of the training data, ML algorithm, independent test data and evaluation metrics specific to the intended use.

A challenge for regulatory assessment is that standardization organizations, notified bodies and manufacturers need to efficiently formulate and parse applicable regulatory requirements for each individual ML4H tool. Comprehensive assessment checklists [45, 51] can help with that task. However, more support is needed in terms of workflow management and assisting tools if we consider the limited time and budgets which professional auditors have at their disposal. Future regulatory checklists should allow for interactive selection of use-case specific sub-checklists, an automated audit report creation, a issue of standard minimum test cases as well as accompanying glossaries and education materials for auditors. We also have to ensure that protocols are in place which translate the audit insights to actual improvements in the ML4H tool. Managing the risks presented by the exciting advances of AI in healthcare is a formidable undertaking, but with collaborative pooling of expertise and resources we believe we can rise to the task.